docs/admin/Tokenizers.md

   1 # Tokenizers
   2
   3 The tokenizer module in Nominatim is responsible for analysing the names given
   4 to OSM objects and the terms of an incoming query in order to make sure, they
   5 can be matched appropriately.
   6
   7 Nominatim offers different tokenizer modules, which behave differently and have
   8 different configuration options. This sections describes the tokenizers and how
   9 they can be configured.
  10
  11 !!! important
  12 The use of a tokenizer is tied to a database installation. You need to choose
  13 and configure the tokenizer before starting the initial import. Once the import
  14 is done, you cannot switch to another tokenizer anymore. Reconfiguring the
  15 chosen tokenizer is very limited as well. See the comments in each tokenizer
  16 section.
  17
  18 ## Legacy tokenizer
  19
  20 The legacy tokenizer implements the analysis algorithms of older Nominatim
  21 versions. It uses a special Postgresql module to normalize names and queries.
  22 This tokenizer is currently the default.
  23
  24 To enable the tokenizer add the following line to your project configuration:
  25
  26 ```
  27 NOMINATIM_TOKENIZER=legacy
  28 ```
  29
  30 The Postgresql module for the tokenizer is available in the `module` directory
  31 and also installed with the remainder of the software under
  32 `lib/nominatim/module/nominatim.so`. You can specify a custom location for
  33 the module with
  34
  35 ```
  36 NOMINATIM_DATABASE_MODULE_PATH=<path to directory where nominatim.so resides>
  37 ```
  38
  39 This is in particular useful when the database runs on a different server.
  40 See [Advanced installations](Advanced-Installations.md#importing-nominatim-to-an-external-postgresql-database) for details.
  41
  42 There are no other configuration options for the legacy tokenizer. All
  43 normalization functions are hard-coded.
  44
  45 ## ICU tokenizer
  46
  47 The ICU tokenizer uses the [ICU library](http://site.icu-project.org/) to
  48 normalize names and queries. It also offers configurable decomposition and
  49 abbreviation handling.
  50
  51 ### How it works
  52
  53 On import the tokenizer processes names in the following four stages:
  54
  55 1. The **Normalization** part removes all non-relevant information from the
  56    input.
  57 2. Incoming names are now converted to **full names**. This process is currently
  58    hard coded and mostly serves to handle name tags from OSM that contain
  59    multiple names (e.g. [Biel/Bienne](https://www.openstreetmap.org/node/240097197)).
  60 3. Next the tokenizer creates **variants** from the full names. These variants
  61    cover decomposition and abbreviation handling. Variants are saved to the
  62    database, so that it is not necessary to create the variants for a search
  63    query.
  64 4. The final **Tokenization** step converts the names to a simple ASCII form,
  65    potentially removing further spelling variants for better matching.
  66
  67 At query time only stage 1) and 4) are used. The query is normalized and
  68 tokenized and the resulting string used for searching in the database.
  69
  70 ### Configuration
  71
  72 The ICU tokenizer is configured using a YAML file which can be configured using
  73 `NOMINATIM_TOKENIZER_CONFIG`. The configuration is read on import and then
  74 saved as part of the internal database status. Later changes to the variable
  75 have no effect.
  76
  77 Here is an example configuration file:
  78
  79 ``` yaml
  80 normalization:
  81     - ":: lower ()"
  82     - "ß > 'ss'" # German szet is unimbigiously equal to double ss
  83 transliteration:
  84     - !include /etc/nominatim/icu-rules/extended-unicode-to-asccii.yaml
  85     - ":: Ascii ()"
  86 variants:
  87     - language: de
  88       words:
  89         - ~haus => haus
  90         - ~strasse -> str
  91     - language: en
  92       words:
  93         - road -> rd
  94         - bridge -> bdge,br,brdg,bri,brg
  95 ```
  96
  97 The configuration file contains three sections:
  98 `normalization`, `transliteration`, `variants`.
  99
 100 The normalization and transliteration sections each must contain a list of
 101 [ICU transformation rules](https://unicode-org.github.io/icu/userguide/transforms/general/rules.html).
 102 The rules are applied in the order in which they appear in the file.
 103 You can also include additional rules from external yaml file using the
 104 `!include` tag. The included file must contain a valid YAML list of ICU rules
 105 and may again include other files.
 106
 107 !!! warning
 108     The ICU rule syntax contains special characters that conflict with the
 109     YAML syntax. You should therefore always enclose the ICU rules in
 110     double-quotes.
 111
 112 The variants section defines lists of replacements which create alternative
 113 spellings of a name. To create the variants, a name is scanned from left to
 114 right and the longest matching replacement is applied until the end of the
 115 string is reached.
 116
 117 The variants section must contain a list of replacement groups. Each group
 118 defines a set of properties that describes where the replacements are
 119 applicable. In addition, the word section defines the list of replacements
 120 to be made. The basic replacement description is of the form:
 121
 122 ```
 123 <source>[,<source>[...]] => <target>[,<target>[...]]
 124 ```
 125
 126 The left side contains one or more `source` terms to be replaced. The right side
 127 lists one or more replacements. Each source is replaced with each replacement
 128 term.
 129
 130 !!! tip
 131     The source and target terms are internally normalized using the
 132     normalization rules given in the configuration. This ensures that the
 133     strings match as expected. In fact, it is better to use unnormalized
 134     words in the configuration because then it is possible to change the
 135     rules for normalization later without having to adapt the variant rules.
 136
 137 #### Decomposition
 138
 139 In its standard form, only full words match against the source. There
 140 is a special notation to match the prefix and suffix of a word:
 141
 142 ``` yaml
 143 - ~strasse => str  # matches "strasse" as full word and in suffix position
 144 - hinter~ => hntr  # matches "hinter" as full word and in prefix position
 145 ```
 146
 147 There is no facility to match a string in the middle of the word. The suffix
 148 and prefix notation automatically trigger the decomposition mode: two variants
 149 are created for each replacement, one with the replacement attached to the word
 150 and one separate. So in above example, the tokenization of "hauptstrasse" will
 151 create the variants "hauptstr" and "haupt str". Similarly, the name "rote strasse"
 152 triggers the variants "rote str" and "rotestr". By having decomposition work
 153 both ways, it is sufficient to create the variants at index time. The variant
 154 rules are not applied at query time.
 155
 156 To avoid automatic decomposition, use the '|' notation:
 157
 158 ``` yaml
 159 - ~strasse |=> str
 160 ```
 161
 162 simply changes "hauptstrasse" to "hauptstr" and "rote strasse" to "rote str".
 163
 164 #### Initial and final terms
 165
 166 It is also possible to restrict replacements to the beginning and end of a
 167 name:
 168
 169 ``` yaml
 170 - ^south => n  # matches only at the beginning of the name
 171 - road$ => rd  # matches only at the end of the name
 172 ```
 173
 174 So the first example would trigger a replacement for "south 45th street" but
 175 not for "the south beach restaurant".
 176
 177 #### Replacements vs. variants
 178
 179 The replacement syntax `source => target` works as a pure replacement. It changes
 180 the name instead of creating a variant. To create an additional version, you'd
 181 have to write `source => source,target`. As this is a frequent case, there is
 182 a shortcut notation for it:
 183
 184 ```
 185 <source>[,<source>[...]] -> <target>[,<target>[...]]
 186 ```
 187
 188 The simple arrow causes an additional variant to be added. Note that
 189 decomposition has an effect here on the source as well. So a rule
 190
 191 ```yaml
 192 - ~strasse => str
 193 ```
 194
 195 means that for a word like `hauptstrasse` four variants are created:
 196 `hauptstrasse`, `haupt strasse`, `hauptstr` and `haupt str`.
 197
 198 ### Reconfiguration
 199
 200 Changing the configuration after the import is currently not possible, although
 201 this feature may be added at a later time.