add documentation for new configuration of ICU tokenizer

author Sarah Hoffmann <lonvia@denofr.de>

Thu, 7 Oct 2021 09:55:53 +0000 (11:55 +0200)

committer Sarah Hoffmann <lonvia@denofr.de>

Thu, 7 Oct 2021 09:55:53 +0000 (11:55 +0200)
author Sarah Hoffmann <lonvia@denofr.de>
Thu, 7 Oct 2021 09:55:53 +0000 (11:55 +0200)
committer Sarah Hoffmann <lonvia@denofr.de>
Thu, 7 Oct 2021 09:55:53 +0000 (11:55 +0200)
diff --git a/docs/admin/Tokenizers.md b/docs/admin/Tokenizers.md

index 6f8898c8ee70690d88aabd63661b758c9ed37b38..90d0fb5e03332702a5c551f3fc69dab317a675c3 100644 (file)
--- a/docs/admin/Tokenizers.md
+++ b/docs/admin/Tokenizers.md
@@ -60,22 +60,23 @@ NOMINATIM_TOKENIZER=icu
  
  ### How it works
  
-On import the tokenizer processes names in the following four stages:
-
-1. The **Normalization** part removes all non-relevant information from the
-   input.
-2. Incoming names are now converted to **full names**. This process is currently
-   hard coded and mostly serves to handle name tags from OSM that contain
-   multiple names (e.g. [Biel/Bienne](https://www.openstreetmap.org/node/240097197)).
-3. Next the tokenizer creates **variants** from the full names. These variants
-   cover decomposition and abbreviation handling. Variants are saved to the
-   database, so that it is not necessary to create the variants for a search
-   query.
-4. The final **Tokenization** step converts the names to a simple ASCII form,
-   potentially removing further spelling variants for better matching.
-
-At query time only stage 1) and 4) are used. The query is normalized and
-tokenized and the resulting string used for searching in the database.
+On import the tokenizer processes names in the following three stages:
+
+1. During the **Sanitizer step** incoming names are cleaned up and converted to
+   **full names**. This step can be used to regularize spelling, split multi-name
+   tags into their parts and tag names with additional attributes. See the
+   [Sanitizers section](#sanitizers) below for available cleaning routines.
+2. The **Normalization** part removes all information from the full names
+   that are not relevant for search.
+3. The **Token analysis** step takes the normalized full names and creates
+   all transliterated variants under which the name should be searchable.
+   See the [Token analysis](#token-analysis) section below for more
+   information.
+
+During query time, only normalization and transliteration are relevant.
+An incoming query is first split into name chunks (this usually means splitting
+the string at the commas) and the each part is normalised and transliterated.
+The result is used to look up places in the search index.
  
  ### Configuration
  
@@ -93,21 +94,36 @@ normalization:
  transliteration:
      - !include /etc/nominatim/icu-rules/extended-unicode-to-asccii.yaml
      - ":: Ascii ()"
-variants:
-    - language: de
-      words:
-        - ~haus => haus
-        - ~strasse -> str
-    - language: en
-      words: 
-        - road -> rd
-        - bridge -> bdge,br,brdg,bri,brg
+sanitizers:
+    - step: split-name-list
+token-analysis:
+    - analyzer: generic
+      variants:
+          - !include icu-rules/variants-ca.yaml
+          - words:
+              - road -> rd
+              - bridge -> bdge,br,brdg,bri,brg
  ```
  
-The configuration file contains three sections:
-`normalization`, `transliteration`, `variants`.
+The configuration file contains four sections:
+`normalization`, `transliteration`, `sanitizers` and `token-analysis`.
  
-The normalization and transliteration sections each must contain a list of
+#### Normalization and Transliteration
+
+The normalization and transliteration sections each define a set of
+ICU rules that are applied to the names.
+
+The **normalisation** rules are applied after sanitation. They should remove
+any information that is not relevant for search at all. Usual rules to be
+applied here are: lower-casing, removing of special characters, cleanup of
+spaces.
+
+The **transliteration** rules are applied at the end of the tokenization
+process to transfer the name into an ASCII representation. Transliteration can
+be useful to allow for further fuzzy matching, especially between different
+scripts.
+
+Each section must contain a list of
  [ICU transformation rules](https://unicode-org.github.io/icu/userguide/transforms/general/rules.html).
  The rules are applied in the order in which they appear in the file.
  You can also include additional rules from external yaml file using the
@@ -119,6 +135,85 @@ and may again include other files.
      YAML syntax. You should therefore always enclose the ICU rules in
      double-quotes.
  
+#### Sanitizers
+
+The sanitizers section defines an ordered list of functions that are applied
+to the name and address tags before they are further processed by the tokenizer.
+They allows to clean up the tagging and bring it to a standardized form more
+suitable for building the search index.
+
+!!! hint
+    Sanitizers only have an effect on how the search index is built. They
+    do not change the information about each place that is saved in the
+    database. In particular, they have no influence on how the results are
+    displayed. The returned results always show the original information as
+    stored in the OpenStreetMap database.
+
+Each entry contains information of a sanitizer to be applied. It has a
+mandatory parameter `step` which gives the name of the sanitizer. Depending
+on the type, it may have additional parameters to configure its operation.
+
+The order of the list matters. The sanitizers are applied exactly in the order
+that is configured. Each sanitizer works on the results of the previous one.
+
+The following is a list of sanitizers that are shipped with Nominatim.
+
+##### split-name-list
+
+::: nominatim.tokenizer.sanitizers.split_name_list
+    selection:
+        members: False
+    rendering:
+        heading_level: 6
+
+##### strip-brace-terms
+
+::: nominatim.tokenizer.sanitizers.strip_brace_terms
+    selection:
+        members: False
+    rendering:
+        heading_level: 6
+
+##### tag-analyzer-by-language
+
+::: nominatim.tokenizer.sanitizers.tag_analyzer_by_language
+    selection:
+        members: False
+    rendering:
+        heading_level: 6
+
+
+
+#### Token Analysis
+
+Token analyzers take a full name and transform it into one or more normalized
+form that are then saved in the search index. In its simplest form, the
+analyzer only applies the transliteration rules. More complex analyzers
+create additional spelling variants of a name. This is useful to handle
+decomposition and abbreviation.
+
+The ICU tokenizer may use different analyzers for different names. To select
+the analyzer to be used, the name must be tagged with the `analyzer` attribute
+by a sanitizer (see for example the
+[tag-analyzer-by-language sanitizer](#tag-analyzer-by-language)).
+
+The token-analysis section contains the list of configured analyzers. Each
+analyzer must have an `id` parameter that uniquely identifies the analyzer.
+The only exception is the default analyzer that is used when no special
+analyzer was selected.
+
+Different analyzer implementations may exist. To select the implementation,
+the `analyzer` parameter must be set. Currently there is only one implementation
+`generic` which is described in the following.
+
+##### Generic token analyzer
+
+The generic analyzer is able to create variants from a list of given
+abbreviation and decomposition replacements. It takes one optional parameter
+`variants` which lists the replacements to apply. If the section is
+omitted, then the generic analyzer becomes a simple analyzer that only
+applies the transliteration.
+
  The variants section defines lists of replacements which create alternative
  spellings of a name. To create the variants, a name is scanned from left to
  right and the longest matching replacement is applied until the end of the
@@ -144,7 +239,7 @@ term.
      words in the configuration because then it is possible to change the
      rules for normalization later without having to adapt the variant rules.
  
-#### Decomposition
+###### Decomposition
  
  In its standard form, only full words match against the source. There
  is a special notation to match the prefix and suffix of a word:
@@ -171,7 +266,7 @@ To avoid automatic decomposition, use the '|' notation:
  
  simply changes "hauptstrasse" to "hauptstr" and "rote strasse" to "rote str".
  
-#### Initial and final terms
+###### Initial and final terms
  
  It is also possible to restrict replacements to the beginning and end of a
  name:
@@ -184,7 +279,7 @@ name:
  So the first example would trigger a replacement for "south 45th street" but
  not for "the south beach restaurant".
  
-#### Replacements vs. variants
+###### Replacements vs. variants
  
  The replacement syntax `source => target` works as a pure replacement. It changes
  the name instead of creating a variant. To create an additional version, you'd
diff --git a/nominatim/tokenizer/sanitizers/split_name_list.py b/nominatim/tokenizer/sanitizers/split_name_list.py

index df2c305bbc2e27cf925594c08be1111650c4cdf6..86385985053ef2623c1c016d458c2d51ae420d44 100644 (file)
--- a/nominatim/tokenizer/sanitizers/split_name_list.py
+++ b/nominatim/tokenizer/sanitizers/split_name_list.py
@@ -1,5 +1,9 @@
  """
-Name processor that splits name values with multiple values into their components.
+Sanitizer that splits lists of names into their components.
+
+Arguments:
+    delimiters: Define the set of characters to be used for
+                splitting the list. (default: `,;`)
  """
  import re
  
@@ -7,9 +11,7 @@ from nominatim.errors import UsageError
  
  def create(func):
      """ Create a name processing function that splits name values with
-        multiple values into their components. The optional parameter
-        'delimiters' can be used to define the characters that should be used
-        for splitting. The default is ',;'.
+        multiple values into their components.
      """
      delimiter_set = set(func.get('delimiters', ',;'))
      if not delimiter_set:
diff --git a/nominatim/tokenizer/sanitizers/strip_brace_terms.py b/nominatim/tokenizer/sanitizers/strip_brace_terms.py

index ec91bac926d2ae3938d9c46f44bb45e1124b0dcd..caadc815edb8e71ec5bca387ee166c10bccd96cb 100644 (file)
--- a/nominatim/tokenizer/sanitizers/strip_brace_terms.py
+++ b/nominatim/tokenizer/sanitizers/strip_brace_terms.py
@@ -1,11 +1,12 @@
  """
-Sanitizer handling names with addendums in braces.
+This sanitizer creates additional name variants for names that have
+addendums in brackets (e.g. "Halle (Saale)"). The additional variant contains
+only the main name part with the bracket part removed.
  """
  
  def create(_):
      """ Create a name processing function that creates additional name variants
-        when a name has an addendum in brackets (e.g. "Halle (Saale)"). The
-        additional variant only contains the main name without the bracket part.
+        for bracket addendums.
      """
      def _process(obj):
          """ Add variants for names that have a bracket extension.
diff --git a/nominatim/tokenizer/sanitizers/tag_analyzer_by_language.py b/nominatim/tokenizer/sanitizers/tag_analyzer_by_language.py

index c0ddd8360275891e967f1511da9f2fdd18c01b0c..739e93136022076f6720aae1fe7bc48bafe88f6b 100644 (file)
--- a/nominatim/tokenizer/sanitizers/tag_analyzer_by_language.py
+++ b/nominatim/tokenizer/sanitizers/tag_analyzer_by_language.py
@@ -1,5 +1,28 @@
  """
-Name processor for tagging the langauge of the name
+This sanitizer sets the `analyzer` property depending on the
+language of the tag. The language is taken from the suffix of the name.
+If a name already has an analyzer tagged, then this is kept.
+
+Arguments:
+
+    filter-kind: Restrict the names the sanitizer should be applied to
+                 to the given tags. The parameter expects a list of
+                 regular expressions which are matched against `kind`.
+                 Note that a match against the full string is expected.
+    whitelist: Restrict the set of languages that should be tagged.
+               Expects a list of acceptable suffixes. When unset,
+               all 2- and 3-letter lower-case codes are accepted.
+    use-defaults:  Configure what happens when the name has no suffix.
+                   When set to 'all', a variant is created for
+                   each of the default languages in the country
+                   the feature is in. When set to 'mono', a variant is
+                   only created, when exactly one language is spoken
+                   in the country. The default is to do nothing with
+                   the default languages of a country.
+    mode: Define how the variants are created and may be 'replace' or
+          'append'. When set to 'append' the original name (without
+          any analyzer tagged) is retained. (default: replace)
+
  """
  import re
  
@@ -75,24 +98,6 @@ class _AnalyzerByLanguage:
  
  def create(config):
      """ Create a function that sets the analyzer property depending on the
-        language of the tag. The language is taken from the suffix.
-
-        To restrict the set of languages that should be tagged, use
-        'whitelist'. A list of acceptable suffixes. When unset, all 2- and
-        3-letter codes are accepted.
-
-        'use-defaults' configures what happens when the name has no suffix
-        with a language tag. When set to 'all', a variant is created for
-        each on the spoken languages in the country the feature is in. When
-        set to 'mono', a variant is created, when only one language is spoken
-        in the country. The default is, to do nothing with the default languages
-        of a country.
-
-        'mode' hay be 'replace' (the default) or 'append' and configures if
-        the original name (without any analyzer tagged) is retained.
-
-        With 'filter-kind' the set of names the sanitizer should be applied
-        to can be retricted to the given patterns of 'kind'. It expects a
-        list of regular expression to be matched against 'kind'.
+        language of the tag.
      """
      return _AnalyzerByLanguage(config)
author	Sarah Hoffmann <lonvia@denofr.de>
	Thu, 7 Oct 2021 09:55:53 +0000 (11:55 +0200)
committer	Sarah Hoffmann <lonvia@denofr.de>
	Thu, 7 Oct 2021 09:55:53 +0000 (11:55 +0200)
docs/admin/Tokenizers.md		patch \| blob \| history
nominatim/tokenizer/sanitizers/split_name_list.py		patch \| blob \| history
nominatim/tokenizer/sanitizers/strip_brace_terms.py		patch \| blob \| history
nominatim/tokenizer/sanitizers/tag_analyzer_by_language.py		patch \| blob \| history