1 # Writing custom sanitizer modules
3 Sanitizers are used for preprocessing name and address information
4 from the OpenStreetMap input data for the search index. Read more about
5 them in the [Customizing sanitizers](../customize/Sanitizers.md) section.
7 This section explains how to write your own sanitizer step function.
9 ## Using custom sanitizer modules
11 To use a custom made sanitizer step, simply refer to the sanitizer module
12 in the `step` property. There are two ways
13 to include external modules: through a library or from the project directory.
15 To include a module from a library, use the absolute import path as name and
16 make sure the library can be found in your PYTHONPATH.
19 You have your sanitizer steps in a Python package my_sanitizer and
20 want to refer to the step implemented in module `translate_street.py`.
23 - step: my_sanitizer.translate_street
24 config: some other config info for the step
27 To use a custom module without creating a library, you can put the module
28 somewhere in your project directory and then use the relative path to the
29 file. Include the whole name of the file including the `.py` ending.
32 You have put your module `translate_street.py` directly into the project
36 - step: translate_street.py
37 config: some other config info for the step
40 ## Basic sanitizer module setup
42 A sanitizer module must export a single factory function `create` with the
46 def create(config: SanitizerConfig) -> Callable[[ProcessInfo], None]
49 The function receives the custom configuration for the sanitizer and must
50 return a callable (function or class) that transforms the name and address
51 terms of a place. When a place is processed, then a `ProcessInfo` object
52 is created from the information that was queried from the database. This
53 object is sequentially handed to each configured sanitizer, so that each
54 sanitizer receives the result of processing from the previous sanitizer.
55 After the last sanitizer is finished, the resulting name and address lists
56 are forwarded to the token analysis module.
58 Sanitizer functions are instantiated once and then called for each place
59 that is imported or updated. They don't need to be thread-safe.
60 If multi-threading is used, each thread creates their own instance of
63 ### Sanitizer configuration
65 ::: nominatim_db.tokenizer.sanitizers.config.SanitizerConfig
69 ### The main filter function of the sanitizer
71 The filter function receives a single object of type `ProcessInfo`
72 which has with three members:
74 * `place: PlaceInfo`: read-only information about the place being processed.
76 * `names: List[PlaceName]`: The current list of names for the place.
77 * `address: List[PlaceName]`: The current list of address names for the place.
79 While the `place` member is provided for information only, the `names` and
80 `address` lists are meant to be manipulated by the sanitizer. It may add and
81 remove entries, change information within a single entry (for example by
82 adding extra attributes) or completely replace the list with a different one.
84 #### PlaceInfo - information about the place
86 ::: nominatim_db.data.place_info.PlaceInfo
91 #### PlaceName - extended naming information
93 ::: nominatim_db.data.place_name.PlaceName
98 ### Example: Filter for US street prefixes
100 The following sanitizer removes the directional prefixes from street names
107 def _filter_function(obj):
108 if obj.place.country_code == 'us' \
109 and obj.place.rank_address >= 26 and obj.place.rank_address <= 27:
110 for name in obj.names:
111 name.name = re.sub(r'^(north|south|west|east) ',
117 return _filter_function
120 This is the most simple form of a sanitizer module. If defines a single
121 filter function and implements the required `create()` function by returning
124 The filter function first checks if the object is interesting for the
125 sanitizer. Namely it checks if the place is in the US (through `country_code`)
126 and it the place is a street (a `rank_address` of 26 or 27). If the
127 conditions are met, then it goes through all available names and
128 removes any leading directional prefix using a simple regular expression.
130 Save the source code in a file in your project directory, for example as
131 `us_streets.py`. Then you can use the sanitizer in your `icu_tokenizer.yaml`:
136 - step: us_streets.py
141 This example is just a simplified show case on how to create a sanitizer.
142 It is not really meant for real-world use: while the sanitizer would
143 correctly transform `West 5th Street` into `5th Street`. it would also
144 shorten a simple `North Street` to `Street`.
146 For more sanitizer examples, have a look at the sanitizers provided by Nominatim.
147 They can be found in the directory
148 [`src/nominatim_db/tokenizer/sanitizers`](https://github.com/osm-search/Nominatim/tree/master/src/nominatim_db/tokenizer/sanitizers).