]> git.openstreetmap.org Git - nominatim.git/commit
introduce sanitizer step before token analysis
authorSarah Hoffmann <lonvia@denofr.de>
Thu, 30 Sep 2021 19:30:13 +0000 (21:30 +0200)
committerSarah Hoffmann <lonvia@denofr.de>
Fri, 1 Oct 2021 10:27:24 +0000 (12:27 +0200)
commit8171fe4571a57bf8e5b2a8f676989e973897e2e7
tree528f1250abb2bdcfbbc262cd1041fa3a78290d09
parent16daa57e4757e4daeffec1e61630f989727dc563
introduce sanitizer step before token analysis

Sanatizer functions allow to transform name and address tags before
they are handed to the tokenizer. Theses transformations are visible
only for the tokenizer and thus only have an influence on the
search terms and address match terms for a place.

Currently two sanitizers are implemented which are responsible for
splitting names with multiple values and removing bracket additions.
Both was previously hard-coded in the tokenizer.
nominatim/tokenizer/icu_rule_loader.py
nominatim/tokenizer/icu_tokenizer.py
nominatim/tokenizer/place_sanitizer.py [new file with mode: 0644]
nominatim/tokenizer/sanitizers/__init__.py [new file with mode: 0644]
nominatim/tokenizer/sanitizers/split_name_list.py [new file with mode: 0644]
nominatim/tokenizer/sanitizers/strip_brace_terms.py [new file with mode: 0644]
settings/icu_tokenizer.yaml
test/python/test_tokenizer_icu.py