add documentation for ICU tokenizer configuration

author Sarah Hoffmann <lonvia@denofr.de>

Sat, 26 Jun 2021 08:13:33 +0000 (10:13 +0200)

committer Sarah Hoffmann <lonvia@denofr.de>

Sun, 4 Jul 2021 08:28:20 +0000 (10:28 +0200)
author Sarah Hoffmann <lonvia@denofr.de>
Sat, 26 Jun 2021 08:13:33 +0000 (10:13 +0200)
committer Sarah Hoffmann <lonvia@denofr.de>
Sun, 4 Jul 2021 08:28:20 +0000 (10:28 +0200)
diff --git a/docs/admin/Tokenizers.md b/docs/admin/Tokenizers.md

new file mode 100644 (file)

index 0000000..a4d6aa0
--- /dev/null
+++ b/docs/admin/Tokenizers.md
@@ -0,0 +1,201 @@
+# Tokenizers
+
+The tokenizer module in Nominatim is responsible for analysing the names given
+to OSM objects and the terms of an incoming query in order to make sure, they
+can be matched appropriately.
+
+Nominatim offers different tokenizer modules, which behave differently and have
+different configuration options. This sections describes the tokenizers and how
+they can be configured.
+
+!!! important
+The use of a tokenizer is tied to a database installation. You need to choose
+and configure the tokenizer before starting the initial import. Once the import
+is done, you cannot switch to another tokenizer anymore. Reconfiguring the
+chosen tokenizer is very limited as well. See the comments in each tokenizer
+section.
+
+## Legacy tokenizer
+
+The legacy tokenizer implements the analysis algorithms of older Nominatim
+versions. It uses a special Postgresql module to normalize names and queries.
+This tokenizer is currently the default.
+
+To enable the tokenizer add the following line to your project configuration:
+
+```
+NOMINATIM_TOKENIZER=legacy
+```
+
+The Postgresql module for the tokenizer is available in the `module` directory
+and also installed with the remainder of the software under
+`lib/nominatim/module/nominatim.so`. You can specify a custom location for
+the module with
+
+```
+NOMINATIM_DATABASE_MODULE_PATH=<path to directory where nominatim.so resides>
+```
+
+This is in particular useful when the database runs on a different server.
+See [Advanced installations](Advanced-Installations.md#importing-nominatim-to-an-external-postgresql-database) for details.
+
+There are no other configuration options for the legacy tokenizer. All
+normalization functions are hard-coded.
+
+## ICU tokenizer
+
+The ICU tokenizer uses the [ICU library](http://site.icu-project.org/) to
+normalize names and queries. It also offers configurable decomposition and
+abbreviation handling.
+
+### How it works
+
+On import the tokenizer processes names in the following four stages:
+
+1. The **Normalization** part removes all non-relevant information from the
+   input.
+2. Incoming names are now converted to **full names**. This process is currently
+   hard coded and mostly serves to handle name tags from OSM that contain
+   multiple names (e.g. [Biel/Bienne](https://www.openstreetmap.org/node/240097197)).
+3. Next the tokenizer creates **variants** from the full names. These variants
+   cover decomposition and abbreviation handling. Variants are saved to the
+   database, so that it is not necessary to create the variants for a search
+   query.
+4. The final **Tokenization** step converts the names to a simple ASCII form,
+   potentially removing further spelling variants for better matching.
+
+At query time only stage 1) and 4) are used. The query is normalized and
+tokenized and the resulting string used for searching in the database.
+
+### Configuration
+
+The ICU tokenizer is configured using a YAML file which can be configured using
+`NOMINATIM_TOKENIZER_CONFIG`. The configuration is read on import and then
+saved as part of the internal database status. Later changes to the variable
+have no effect.
+
+Here is an example configuration file:
+
+``` yaml
+normalization:
+    - ":: lower ()"
+    - "ß > 'ss'" # German szet is unimbigiously equal to double ss
+transliteration:
+    - !include /etc/nominatim/icu-rules/extended-unicode-to-asccii.yaml
+    - ":: Ascii ()"
+variants:
+    - language: de
+      words:
+        - ~haus => haus
+        - ~strasse -> str
+    - language: en
+      words: 
+        - road -> rd
+        - bridge -> bdge,br,brdg,bri,brg
+```
+
+The configuration file contains three sections:
+`normalization`, `transliteration`, `variants`.
+
+The normalization and transliteration sections each must contain a list of
+[ICU transformation rules](https://unicode-org.github.io/icu/userguide/transforms/general/rules.html).
+The rules are applied in the order in which they appear in the file.
+You can also include additional rules from external yaml file using the
+`!include` tag. The included file must contain a valid YAML list of ICU rules
+and may again include other files.
+
+!!! warning
+    The ICU rule syntax contains special characters that conflict with the
+    YAML syntax. You should therefore always enclose the ICU rules in
+    double-quotes.
+
+The variants section defines lists of replacements which create alternative
+spellings of a name. To create the variants, a name is scanned from left to
+right and the longest matching replacement is applied until the end of the
+string is reached.
+
+The variants section must contain a list of replacement groups. Each group
+defines a set of properties that describes where the replacements are
+applicable. In addition, the word section defines the list of replacements
+to be made. The basic replacement description is of the form:
+
+```
+<source>[,<source>[...]] => <target>[,<target>[...]]
+```
+
+The left side contains one or more `source` terms to be replaced. The right side
+lists one or more replacements. Each source is replaced with each replacement
+term.
+
+!!! tip
+    The source and target terms are internally normalized using the
+    normalization rules given in the configuration. This ensures that the
+    strings match as expected. In fact, it is better to use unnormalized
+    words in the configuration because then it is possible to change the
+    rules for normalization later without having to adapt the variant rules.
+
+#### Decomposition
+
+In its standard form, only full words match against the source. There
+is a special notation to match the prefix and suffix of a word:
+
+``` yaml
+- ~strasse => str  # matches "strasse" as full word and in suffix position
+- hinter~ => hntr  # matches "hinter" as full word and in prefix position
+```
+
+There is no facility to match a string in the middle of the word. The suffix
+and prefix notation automatically trigger the decomposition mode: two variants
+are created for each replacement, one with the replacement attached to the word
+and one separate. So in above example, the tokenization of "hauptstrasse" will
+create the variants "hauptstr" and "haupt str". Similarly, the name "rote strasse"
+triggers the variants "rote str" and "rotestr". By having decomposition work
+both ways, it is sufficient to create the variants at index time. The variant
+rules are not applied at query time.
+
+To avoid automatic decomposition, use the '|' notation:
+
+``` yaml
+- ~strasse |=> str
+```
+
+simply changes "hauptstrasse" to "hauptstr" and "rote strasse" to "rote str".
+
+#### Initial and final terms
+
+It is also possible to restrict replacements to the beginning and end of a
+name:
+
+``` yaml
+- ^south => n  # matches only at the beginning of the name
+- road$ => rd  # matches only at the end of the name
+```
+
+So the first example would trigger a replacement for "south 45th street" but
+not for "the south beach restaurant".
+
+#### Replacements vs. variants
+
+The replacement syntax `source => target` works as a pure replacement. It changes
+the name instead of creating a variant. To create an additional version, you'd
+have to write `source => source,target`. As this is a frequent case, there is
+a shortcut notation for it:
+
+```
+<source>[,<source>[...]] -> <target>[,<target>[...]]
+```
+
+The simple arrow causes an additional variant to be added. Note that
+decomposition has an effect here on the source as well. So a rule
+
+```yaml
+- ~strasse => str
+```
+
+means that for a word like `hauptstrasse` four variants are created:
+`hauptstrasse`, `haupt strasse`, `hauptstr` and `haupt str`.
+
+### Reconfiguration
+
+Changing the configuration after the import is currently not possible, although
+this feature may be added at a later time.
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml

index ef2ef9a5cbcf6c16c481a98504009c9a5c6d6098..5c6147aa6cafae3892fcad460453e83563e78bf6 100644 (file)
--- a/docs/mkdocs.yml
+++ b/docs/mkdocs.yml
@@ -20,6 +20,7 @@ pages:
          - 'Update' : 'admin/Update.md'
          - 'Deploy' : 'admin/Deployment.md'
          - 'Customize Imports' : 'admin/Customization.md'
+        - 'Tokenizers' : 'admin/Tokenizers.md'
          - 'Nominatim UI'  : 'admin/Setup-Nominatim-UI.md'
          - 'Advanced Installations' : 'admin/Advanced-Installations.md'
          - 'Migration from older Versions' : 'admin/Migration.md'
author	Sarah Hoffmann <lonvia@denofr.de>
	Sat, 26 Jun 2021 08:13:33 +0000 (10:13 +0200)
committer	Sarah Hoffmann <lonvia@denofr.de>
	Sun, 4 Jul 2021 08:28:20 +0000 (10:28 +0200)
docs/admin/Tokenizers.md	[new file with mode: 0644]	patch \| blob
docs/mkdocs.yml		patch \| blob \| history