ICU transform - OpenSearch Documentation

ICU transform - OpenSearch Documentation
ICU transform | OpenSearch Documentation
OpenSearch
About
Releases
Roadmap
FAQ
Platform
Observability
Security Analytics
Vector Database
Playground Demo
Performance Benchmarks
Community
Forum
Slack
Events
Solutions Providers
Projects
Members
Documentation
OpenSearch and Dashboards
Data Prepper
Clients
Benchmark
Migration Assistant
Blog
Documentation
ICU transform token filter
The
icu_transform
token filter applies ICU text transformations to tokens, enabling operations such as transliteration, case mapping, normalization, and bidirectional text handling. This filter uses transformation rules defined by the
ICU Transform
framework.
Common use cases include:
Transliteration
: Converting text from one script to another (for example, Cyrillic to Latin)
Script conversion
: Transforming between different writing systems
Accent removal
: Separating base characters from diacritics
Custom transformations
: Applying user-defined transformation rules
Installation
The
icu_transform
token filter requires the
analysis-icu
plugin. For installation instructions, see
ICU analyzer
Parameters
The following table lists the parameters for the
icu_transform
token filter.
Parameter
Data type
Description
id
String
The ICU transform ID specifying which transformation to apply. Can be a single transform ID or a compound ID with multiple transforms separated by semicolons. Default is
Null
(no transformation).
dir
String
The text direction for the transformation. Valid values are
forward
(default, left-to-right) and
reverse
(right-to-left). Default is
forward
Transform IDs
You can specify transformations using standard ICU transform IDs. Common transforms include:
Any-Latin
: Transliterates text from any script to Latin characters
Latin-Cyrillic
: Converts Latin text to Cyrillic
NFD; [:Nonspacing Mark:] Remove; NFC
: Decomposes characters, removes diacritics, then recomposes
Lower
: Converts text to lowercase
Upper
: Converts text to uppercase
Hiragana-Katakana
: Converts Hiragana to Katakana
You can chain multiple transforms by separating them with semicolons.
Example: Transliterating to Latin
The following example demonstrates transliteration of multiple scripts to Latin characters:
PUT
/icu-transform-latin
"settings"
"analysis"
"filter"
"latin_transform"
"type"
"icu_transform"
"id"
"Any-Latin"
},
"analyzer"
"latin_analyzer"
"tokenizer"
"keyword"
"filter"
"latin_transform"
copy
Test the analyzer with text in different scripts:
POST
/icu-transform-latin/_analyze
"analyzer"
"latin_analyzer"
"text"
"Москва"
copy
The Cyrillic text is transliterated to Latin:
"tokens"
"token"
"Moskva"
"start_offset"
"end_offset"
"type"
"word"
"position"
Test with Japanese text:
POST
/icu-transform-latin/_analyze
"analyzer"
"latin_analyzer"
"text"
"東京"
copy
The Japanese characters are transliterated:
"tokens"
"token"
"dōng jīng"
"start_offset"
"end_offset"
"type"
"word"
"position"
Example: Removing accents
The following example removes diacritical marks from text:
PUT
/icu-transform-no-accents
"settings"
"analysis"
"filter"
"remove_accents"
"type"
"icu_transform"
"id"
"NFD; [:Nonspacing Mark:] Remove; NFC"
},
"analyzer"
"accent_removal_analyzer"
"tokenizer"
"keyword"
"filter"
"remove_accents"
copy
Test the analyzer:
POST
/icu-transform-no-accents/_analyze
"analyzer"
"accent_removal_analyzer"
"text"
"Ênrique Iglesias"
copy
The accents are removed:
"tokens"
"token"
"Enrique Iglesias"
"start_offset"
"end_offset"
16
"type"
"word"
"position"
Example: Script-to-script conversion
The following example converts Latin text to Cyrillic:
PUT
/icu-transform-cyrillic
"settings"
"analysis"
"filter"
"to_cyrillic"
"type"
"icu_transform"
"id"
"Latin-Cyrillic"
},
"analyzer"
"cyrillic_analyzer"
"tokenizer"
"keyword"
"filter"
"to_cyrillic"
copy
Test with Latin text:
POST
/icu-transform-cyrillic/_analyze
"analyzer"
"cyrillic_analyzer"
"text"
"Sankt Peterburg"
copy
The text is converted to Cyrillic script:
"tokens"
"token"
"Санкт Петербург"
"start_offset"
"end_offset"
15
"type"
"word"
"position"
Compound transformations
You can chain multiple transformations by separating transform IDs with semicolons. The transformations are applied in order from left to right.
For example, the compound ID
"Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC"
performs the following steps:
Transliterates to Latin
Applies canonical decomposition (NFD)
Removes non-spacing marks (accents)
Applies canonical composition (NFC)
Related documentation
ICU analyzer
ICU tokenizer
ICU folding token filter
Installation
Parameters
Transform IDs
Example: Transliterating to Latin
Example: Removing accents
Example: Script-to-script conversion
Compound transformations
Related documentation
WAS THIS PAGE HELPFUL?
✔ Yes
✖ No
Tell us why
350 characters left
Thank you for your feedback!
Have a question?
Ask us on the OpenSearch forum
Want to contribute?
Edit this page
or
create an issue
OpenSearch Links
Get Involved
Code of Conduct
Forum
GitHub
Slack
Resources
About
Release Schedule
Maintenance Policy
FAQ
Testimonials
Trademark and Brand Policy
Connect
Meetup
Copyright © OpenSearch Project a Series of LF Projects, LLC
For web site terms of use, trademark policy and other project policies please see