mogita [Wed, 19 May 2021 05:35:15 +0000 (13:35 +0800)]
fix: add the missing question mark
Sarah Hoffmann [Tue, 18 May 2021 21:00:10 +0000 (23:00 +0200)]
Merge pull request #2336 from lonvia/do-not-mask-error-when-loading-tokenizer
Do not hide errors when importing tokenizer
Sarah Hoffmann [Tue, 18 May 2021 20:58:25 +0000 (22:58 +0200)]
Merge pull request #2321 from AntoJvlt/csv-import-special-phrases
CSV import for special phrases and loader refactoring
AntoJvlt [Mon, 17 May 2021 21:00:22 +0000 (23:00 +0200)]
Documentation update and small code fixes
Sarah Hoffmann [Tue, 18 May 2021 14:28:21 +0000 (16:28 +0200)]
do not hide errors when importing tokenizer
Explicitly check for the tokenizer source file to check that
the name is correct. We can't use the import error for that
because it hides other import errors like a missing
library.
Fixes #2327.
Sarah Hoffmann [Tue, 18 May 2021 09:30:58 +0000 (11:30 +0200)]
Merge pull request #2332 from lonvia/fix-keyword-details
Always use object type for details keywords
Sarah Hoffmann [Mon, 17 May 2021 14:36:32 +0000 (16:36 +0200)]
always use object type for details keywords
When name and address is empty, the keywords field in the response
of the details API would be an array because that is what PHP's
json_encode defaults to with empty array(). This default can only
be changed globally per json_encode call and that might cause
unintended colleteral damage. Work around the issue by making
name and address an empty array instead of keywords.
Fixes #2329.
AntoJvlt [Mon, 17 May 2021 11:52:35 +0000 (13:52 +0200)]
Resolve conflicts
AntoJvlt [Mon, 17 May 2021 10:53:58 +0000 (12:53 +0200)]
Special phrases documentation updated
AntoJvlt [Mon, 17 May 2021 10:40:50 +0000 (12:40 +0200)]
Added --no-replace command for special phrases importation and added corresponding tests
AntoJvlt [Sun, 16 May 2021 14:59:12 +0000 (16:59 +0200)]
Code cleaning and SPLoader deleted
AntoJvlt [Sun, 16 May 2021 13:32:22 +0000 (15:32 +0200)]
Add tests for the new SPWikiLoader and SPCsvLoader
Sarah Hoffmann [Fri, 14 May 2021 08:40:22 +0000 (10:40 +0200)]
Merge pull request #2323 from darkshredder/disable-search-reverse-only
Feat: Disabled search API for --reverse-only imports
Sarah Hoffmann [Fri, 14 May 2021 07:58:50 +0000 (09:58 +0200)]
Merge pull request #2328 from lonvia/convert-tiger-to-csv
Switch external Tiger data to CSV format
Sarah Hoffmann [Fri, 14 May 2021 07:44:10 +0000 (09:44 +0200)]
install default settings for legacy_icu tokenizer
Sarah Hoffmann [Thu, 13 May 2021 21:39:01 +0000 (23:39 +0200)]
adapt documentation to use Tiger CSV dump
Sarah Hoffmann [Thu, 13 May 2021 21:37:51 +0000 (23:37 +0200)]
adapt tests to new TIGER CSV format
Sarah Hoffmann [Thu, 13 May 2021 20:11:41 +0000 (22:11 +0200)]
use tokenizer during Tiger data import
This also changes the required import format to CSV.
Darkshredder [Wed, 12 May 2021 21:44:37 +0000 (03:14 +0530)]
feat: Added reverse-only-search validation
Sarah Hoffmann [Thu, 13 May 2021 20:09:56 +0000 (22:09 +0200)]
Merge pull request #2326 from lonvia/wokerpool-for-tiger-data
Use WorkerPool when importing Tiger data
Sarah Hoffmann [Thu, 13 May 2021 18:16:30 +0000 (20:16 +0200)]
use WorkerPool for Tiger data import
Requires adding an option that SQL errors are ignored.
Sarah Hoffmann [Thu, 13 May 2021 15:11:17 +0000 (17:11 +0200)]
move WorkerPool into db module
The pool is independent of the indexer and may also be used
by other parts of the software.
Sarah Hoffmann [Thu, 13 May 2021 15:00:29 +0000 (17:00 +0200)]
Merge pull request #2325 from lonvia/do-not-precompute-postcodes
Do not preload postcodes in the legacy tokenizer
Frederik Ramm [Thu, 6 May 2021 18:44:04 +0000 (20:44 +0200)]
Add array_key_last function for PHP <7.3
This patch adds an array_key_last function if it doesn't yet exist, fixes #2316. It is tested on PHP 7.2.24 but not PHP 7.3.
Sarah Hoffmann [Thu, 13 May 2021 14:14:12 +0000 (16:14 +0200)]
do not preload postcodes
This is too expensive for updates.
Sarah Hoffmann [Thu, 13 May 2021 12:52:19 +0000 (14:52 +0200)]
Merge pull request #2324 from lonvia/generic-external-postcodes
Rework postcode handling and generalised external postcode support
Sarah Hoffmann [Thu, 13 May 2021 12:31:41 +0000 (14:31 +0200)]
fix token_info migration
A bad indent meant that only one table received the new column.
Sarah Hoffmann [Thu, 13 May 2021 10:19:20 +0000 (12:19 +0200)]
ignore invalid coordinates in external postcodes
Sarah Hoffmann [Thu, 13 May 2021 10:07:20 +0000 (12:07 +0200)]
ignore entries without country code
Sarah Hoffmann [Thu, 13 May 2021 10:04:47 +0000 (12:04 +0200)]
add documentation for external postcode feature
Sarah Hoffmann [Thu, 13 May 2021 07:59:34 +0000 (09:59 +0200)]
correctly handle removing all postcodes for country
Sarah Hoffmann [Wed, 12 May 2021 22:14:52 +0000 (00:14 +0200)]
index postcodes after refreshing
Sarah Hoffmann [Wed, 12 May 2021 21:30:45 +0000 (23:30 +0200)]
add and extend tests for new postcode handling
Sarah Hoffmann [Wed, 12 May 2021 17:57:48 +0000 (19:57 +0200)]
move filling of postcode table to python
The Python code now takes care of reading postcodes from placex,
enhancing them with potentially existing external postcodes and
updating location_postcodes accordingly. The initial setup and
updates use exactly the same function.
External postcode handling has been generalized. External postcodes
for any country are now accepted. The format of the external postcode
file has changed. We now expect CSV, potentially gzipped. The
postcodes are no longer saved in the database.
Sarah Hoffmann [Wed, 12 May 2021 18:25:22 +0000 (20:25 +0200)]
Merge pull request #2322 from mtmail/type-label-already-lowercased
typelabel value is already lowercased
marc tobias [Wed, 12 May 2021 17:16:51 +0000 (19:16 +0200)]
typelabel value is already lowercased
AntoJvlt [Mon, 10 May 2021 21:09:00 +0000 (23:09 +0200)]
Introduction of SPCsvLoader to load special phrases from a csv file
AntoJvlt [Mon, 10 May 2021 19:48:11 +0000 (21:48 +0200)]
Refactoring loading of external special phrases and importation process by introducing SPLoader and SPWikiLoader
Sarah Hoffmann [Thu, 6 May 2021 15:41:53 +0000 (17:41 +0200)]
Merge pull request #2314 from lonvia/fix-status-no-import-date
Correctly catch the exception when import date is missing
Sarah Hoffmann [Thu, 6 May 2021 15:22:04 +0000 (17:22 +0200)]
Merge pull request #2312 from lonvia/icu-tokenizer
Add new tokenizer based on libICU
Sarah Hoffmann [Thu, 6 May 2021 13:36:54 +0000 (15:36 +0200)]
correctly catch the exception when import date is missing
Sarah Hoffmann [Wed, 5 May 2021 19:16:55 +0000 (21:16 +0200)]
add missing transliterations
The ICU library only offers transliterations for a limited set of
script. Add transliterations for missing scripts from the PostgreSQL
module. These means that the same selection of scripts is supported
as with the old module.
Sarah Hoffmann [Wed, 5 May 2021 15:09:38 +0000 (17:09 +0200)]
fix name of transliterator
Should be different from the normalisation rules.
Sarah Hoffmann [Wed, 5 May 2021 08:00:34 +0000 (10:00 +0200)]
enable BDD tests for different tokenizers
The tokenizer to be used can be choosen with -DTOKENIZER.
Adapt all tests, so that they work with legacy_icu tokenizer.
Move lookup in word table to a function in the tokenizer.
Special phrases are temporarily imported from the wiki until
we have an implementation that can import from file. TIGER
tests do not work yet.
Sarah Hoffmann [Tue, 4 May 2021 16:32:57 +0000 (18:32 +0200)]
add unit tests for legacy ICU tokenizer
Sarah Hoffmann [Sun, 2 May 2021 20:13:18 +0000 (22:13 +0200)]
cache translieration results
Sarah Hoffmann [Sun, 2 May 2021 19:21:41 +0000 (21:21 +0200)]
add PHP part for new ICU-base tokenizer
Sarah Hoffmann [Sun, 2 May 2021 15:52:45 +0000 (17:52 +0200)]
add Python part for new ICU-based tokenizer
Sarah Hoffmann [Tue, 4 May 2021 10:45:26 +0000 (12:45 +0200)]
Merge pull request #2310 from RhinoDevel/master
2nd try: Add hint about replication update & recheck intervals being in seconds.
Marc [Tue, 4 May 2021 09:47:15 +0000 (11:47 +0200)]
Add hint about replication update & recheck intervals being in seconds.
Sarah Hoffmann [Mon, 3 May 2021 07:15:34 +0000 (09:15 +0200)]
Merge pull request #2305 from lonvia/tokenizer
Factor out normalization into a separate module
Sarah Hoffmann [Sat, 1 May 2021 08:50:39 +0000 (10:50 +0200)]
mock tokenizer factory for replication tests
Sarah Hoffmann [Sat, 1 May 2021 08:28:49 +0000 (10:28 +0200)]
commit between migrations
Later migrations may require tables set up by older ones.
Sarah Hoffmann [Sat, 1 May 2021 08:03:00 +0000 (10:03 +0200)]
increase database version for tokenizer migration
Sarah Hoffmann [Fri, 30 Apr 2021 15:59:50 +0000 (17:59 +0200)]
fix liniting issues
Sarah Hoffmann [Fri, 30 Apr 2021 15:28:34 +0000 (17:28 +0200)]
move index creation for word table to tokenizer
This introduces a finalization routing for the tokenizer
where it can post-process the import if necessary.
Sarah Hoffmann [Fri, 30 Apr 2021 14:17:28 +0000 (16:17 +0200)]
indexer: fetch extra place data asynchronously
The indexer now fetches any extra data besides the place_id
asynchronously while processing the places from the last batch.
This also means that more places are now fetched at once.
Sarah Hoffmann [Thu, 29 Apr 2021 20:16:31 +0000 (22:16 +0200)]
fetch place info asynchronously
Sarah Hoffmann [Thu, 29 Apr 2021 19:57:43 +0000 (21:57 +0200)]
indexer: fetch ids in batches
Sarah Hoffmann [Wed, 28 Apr 2021 19:15:18 +0000 (21:15 +0200)]
move database check for module to tokenizer
Sarah Hoffmann [Wed, 28 Apr 2021 18:13:51 +0000 (20:13 +0200)]
move status test to tokenizer
The availability of the module is now tested by the tokenizer.
Sarah Hoffmann [Wed, 28 Apr 2021 15:39:03 +0000 (17:39 +0200)]
add more tests for legacy tokenizer
Sarah Hoffmann [Wed, 28 Apr 2021 12:08:24 +0000 (14:08 +0200)]
move tokenization in query into tokenizer
Sarah Hoffmann [Wed, 28 Apr 2021 08:59:07 +0000 (10:59 +0200)]
boilerplate for PHP code of tokenizer
This adds an installation step for PHP code for the tokenizer. The
PHP code is split in two parts. The updateable code is found in
lib-php. The tokenizer installs an additional script in the
project directory which then includes the code from lib-php and
defines all settings that are static to the database. The website
code then always includes the PHP from the project directory.
Sarah Hoffmann [Wed, 28 Apr 2021 07:14:32 +0000 (09:14 +0200)]
tests for legacy tokenizer
Sarah Hoffmann [Tue, 27 Apr 2021 19:50:35 +0000 (21:50 +0200)]
move amenity creation to tokenizer
The BDD tests still use the old-style amenity creation scripts
because we don't have simple means to import a hand-crafted
test file of special phrases right now.
Sarah Hoffmann [Tue, 27 Apr 2021 09:37:18 +0000 (11:37 +0200)]
move default country name creation to tokenizer
The new function is also used, when a country us updated. All SQL
function related to country names have been removed.
Sarah Hoffmann [Mon, 26 Apr 2021 15:30:10 +0000 (17:30 +0200)]
cache all postcodes
Sarah Hoffmann [Mon, 26 Apr 2021 14:50:28 +0000 (16:50 +0200)]
reorganise address iteration in tokenizer
Sarah Hoffmann [Sun, 25 Apr 2021 21:43:57 +0000 (23:43 +0200)]
remove debug code
Sarah Hoffmann [Sun, 25 Apr 2021 21:42:56 +0000 (23:42 +0200)]
use address tokens in SQL
Sarah Hoffmann [Sun, 25 Apr 2021 20:04:07 +0000 (22:04 +0200)]
extract address tokens in tokenizer
Sarah Hoffmann [Sun, 25 Apr 2021 16:26:36 +0000 (18:26 +0200)]
move postcode normalization into tokenizer
Sarah Hoffmann [Sun, 25 Apr 2021 09:47:29 +0000 (11:47 +0200)]
move houseunumber handling to tokenizer
Normalization and token computation are now done in the tokenizer.
The tokenizer keeps a cache to the hundred most used house numbers
to keep the numbers of calls to the database low.
Sarah Hoffmann [Sun, 25 Apr 2021 08:38:29 +0000 (10:38 +0200)]
move name token creation into tokenizer
Name tokens are now handed in via token_info and used from there.
Also moves the generic search name insertion function back to
placex_triggers.sql.
Sarah Hoffmann [Sat, 24 Apr 2021 20:35:46 +0000 (22:35 +0200)]
introduce name analyzer
The name analyzer is the actual work horse of the tokenizer. It
is instantiated on a thread-base and provides all functions for
analysing names and queries.
Sarah Hoffmann [Sat, 24 Apr 2021 09:25:47 +0000 (11:25 +0200)]
require tokeinzer for indexer
Sarah Hoffmann [Fri, 23 Apr 2021 15:02:47 +0000 (17:02 +0200)]
introduce index for finding surrounding buildings
Sarah Hoffmann [Fri, 23 Apr 2021 14:15:00 +0000 (16:15 +0200)]
add extra column for tokenizer
Add a jsonb column to the placex and location_property_osmline tables
which can be used by the installed tokenizer as required. No other
part of the software will use or otherwise rely on this column.
Sarah Hoffmann [Fri, 23 Apr 2021 13:49:38 +0000 (15:49 +0200)]
introduce external processing in indexer
Indexing is now split into three parts: first a preparation step
that collects the necessary information from the database and
returns it to Python. In a second step the data is transformed
within Python as necessary and then returned to the database
through the usual UPDATE which now not only sets the indexed_status
but also other fields. The third step comprises the address
computation which is still done inside the update trigger in
the database.
The second processing step doesn't do anything useful yet.
Sarah Hoffmann [Thu, 22 Apr 2021 20:47:34 +0000 (22:47 +0200)]
move word table and normalisation SQL into tokenizer
Creating and populating the word table is now the responsibility
of the tokenizer.
The get_maxwordfreq() function has been replaced with a
simple template parameter to the SQL during function installation.
The number is taken from the parameter list in the database to
ensure that it is not changed after installation.
Sarah Hoffmann [Wed, 21 Apr 2021 13:38:52 +0000 (15:38 +0200)]
add migration for configurable tokenizer
Adds a migration that initialises a legacy tokenizer for
an existing database. The migration is not active yet as
it will need completion when more functionality is added
to the legacy tokenizer.
Sarah Hoffmann [Wed, 21 Apr 2021 13:00:37 +0000 (15:00 +0200)]
move module installation to legacy tokenizer
Sarah Hoffmann [Wed, 21 Apr 2021 07:57:17 +0000 (09:57 +0200)]
introduce tokenizer modules
This adds the boilerplate for selecting configurable tokenizers.
A tokenizer can be chosen at import time and will then install
itself such that it is fixed for the given database import even
when the software itself is updated.
The legacy tokenizer implements Nominatim's traditional algorithms.
Sarah Hoffmann [Fri, 30 Apr 2021 09:19:35 +0000 (11:19 +0200)]
Merge pull request #2303 from lonvia/remove-aux-support
Remove support for AUX housenumber tables
Sarah Hoffmann [Fri, 30 Apr 2021 08:08:29 +0000 (10:08 +0200)]
remove support for AUX housenumber tables
These tables have never been actively maintained and the code is
completely untested. With the upcomming changes, it is unlikely
that the code remains usable.
This removes the aux tables and all code that references them.
Sarah Hoffmann [Tue, 27 Apr 2021 10:18:45 +0000 (12:18 +0200)]
Merge pull request #2299 from lonvia/update-actions
Fix database check for reverse-only
Sarah Hoffmann [Tue, 27 Apr 2021 09:57:05 +0000 (11:57 +0200)]
Merge pull request #2291 from AntoJvlt/special-phrases-statistics
Special phrases statistics
Sarah Hoffmann [Tue, 27 Apr 2021 08:14:26 +0000 (10:14 +0200)]
do not check for extra housenumber index for reverse-only
Also adds a database check for reverse only import to the CI.
Sarah Hoffmann [Mon, 26 Apr 2021 21:01:06 +0000 (23:01 +0200)]
add tests for different scripts
Sarah Hoffmann [Mon, 26 Apr 2021 09:21:44 +0000 (11:21 +0200)]
Merge pull request #2298 from lonvia/add-warming-to-ci
Add warming to CI import tests and fix more Python 3.5 compatibility issues
Sarah Hoffmann [Mon, 26 Apr 2021 08:16:05 +0000 (10:16 +0200)]
avoid Path in subprocess parameters
Not supported by Python 3.5.
Sarah Hoffmann [Mon, 26 Apr 2021 07:54:09 +0000 (09:54 +0200)]
add warming to CI import test
AntoJvlt [Sun, 25 Apr 2021 15:56:12 +0000 (17:56 +0200)]
Switching to log info and only send warning for invalid phrases
AntoJvlt [Thu, 22 Apr 2021 15:34:35 +0000 (17:34 +0200)]
Implemented statistics for the import of special phrases through the SpecialPhrasesImporterStatistics class
AntoJvlt [Wed, 21 Apr 2021 15:11:57 +0000 (17:11 +0200)]
reorganization of folder/file for the special phrases importer
Sarah Hoffmann [Sat, 24 Apr 2021 13:35:00 +0000 (15:35 +0200)]
Merge pull request #2297 from lonvia/update-deployment-docs
docs: update deployment to use project directory
Sarah Hoffmann [Sat, 24 Apr 2021 13:03:28 +0000 (15:03 +0200)]
Merge pull request #2296 from lonvia/disable-too-few-public-methods-check
pylint: disable too-few-public-methods check
Sarah Hoffmann [Sat, 24 Apr 2021 13:00:18 +0000 (15:00 +0200)]
docs: update deployment to use project directory
Fixes #2295.
Sarah Hoffmann [Sat, 24 Apr 2021 09:44:36 +0000 (11:44 +0200)]
fix pylint complaints