User:TJones (WMF)/Notes/Expanding Decima

June 2025 — See TJones (WMF)/Notes for other projects. See also T396530. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

decimal_digit is a token filter that converts lots of different kinds of digits from different scripts, mathy variants, etc., to the plain Western Arabic numerals 0–9. For example, all of these are converted to "3": ٣ ۳ ३ ৩ ੩ ౩ ൩ ๓ ໓ ༣ ៣ ᠓ ᥉ 𐒣 ߃ ᧓ ᭓ ႓ ᮳ ᱃ ᱓ ꘣ ꣓ ꤃ ꩓ ᪃ ᪓ ꧓ ꯳ 𑁩 𑃳 𑛃 𑜳 ꧳ 𖭓 ෩ 𑋳 𑓓 𑣣 𖩣 ３ 𝟑 𝟛 𝟥 𝟹 𑙓. (Arabic, Devanagari, Bengali, Gurmukhi, Telugu, Malayalam, Thai, Lao, Tibetan, Khmer, Mongolian, Limbu, Psmanya, N'ko, New Tai Lue, Balinese, Myanmar, Sundanese Lepcha, Ol Chiki, Vai, Saurashtra, Katah Li, Cham, Tai Tham Hora, Tai Tham Tham, Javanese, Meetei Mayek, Brahmi, Sora Sompeng, Takri, Modi, Ahom, Myanmar Tai Laing, Pahawh Hmong, Sinhala, Khudawadi, Tirhuta, Warang Citi, and Mro—plus CJK fullwidth digits and math variants.)

decimal_digit is already in use for Arabic (and by extension Egyptian Arabic and Moroccan Arabic), Bengali, Hindi, Persian, Sorani, and Thai. There doesn't seem to be any reason why it should not be generally applicable.

However, decimal_digit is redundant when icu_folding is available. It also has the possibility of interacting with various non-standard tokenizers.

In the default monolithic analyzers for Arabic, Bengali, Hindi, Persian, Sorani, and Thai, decimal_digit comes right after lowercase.

Implementation & Analysis

[edit]

I'm using my ever-growing collection of small samples from many languages, with up to 500 articles for most languages and 500 user queries for many languages. I have samples for 130 languages, plus a small multilingual corpus of text in more than 100 languages that every config can be run against to see how they handle foreign scripts and other "interesting" characters.

I removed the custom decimal_digit config for the languages where it was explicitly configured, and made it a global token filter, conditioned on the absence of icu_folding. As all of the languages that originally had it enabled also have icu_folding now, it was removed from all of them in the production configuration; there were no analysis changes in the samples for those languages.

Many other languages have icu_folding enabled, so there was no change in their configs (or analysis results). Some languages had decimal_digit added to their configs, but had no non-Arabic digits in their samples. Others had a few digits in foreign scripts that got normalized; these normalized numerals often match other numerals in the sample (e.g., १९७७ matches 1977).

Languages that primarily use scripts covered by decimal_digit had a lot more of the same kind of changes.

A couple of languages had unexpected wrinkles:

Chinese splits text that is not Chinese, Latin, or Arabic digits (0–9) into single character chunks, so Devanagari १९८४ gets tokenized as १, ९, ८, ४, which decimal_digit would then normalize as 1, 9, 8, 4. So, १९८४ would not match 1984 (in Arabic digits) but it would match ৪৮৯১ ("4891" in Benali digits) or ๑๒๓๔๕๖๗๘๙ ("123456789" in Thai digits)—which kind of defeats the purpose of decimal_digit. Fortunately, Chinese has icu_folding enabled for its plain field, so simple numerals can match across scripts. Chinese should not enable decimal_digit.

Santali uses the Ol Chiki script, and sometimes mixes punctuation-looking characters with numerals—particularly ᱹ ᱼ and ᱺ which look like Latin/ASCII . - and : . For example: ᱐᱑ᱹ᱑᱑ᱹ᱑᱙᱘᱙, ᱐᱕ᱺ᱓᱐, ᱖᱓᱙ᱼ᱒, and ᱼ᱖᱐ (vs 01.11.1989, 05:30, 639-2, and -60.) These Ol Chiki characters are not numerals, but they collectively stood out in my analysis after converting the surrounding Ol Chiki numerals to Arabic numerals. Regardless of decimal_digit, ᱐᱑ᱹ᱑᱑ᱹ᱑᱙᱘᱙ doesn't match ᱐᱑.᱑᱑.᱑᱙᱘᱙. I had almost 500 numerical examples in my sample of 500 documents, so this is not a vanishingly small issue.
- Since I'm here looking at this issue now and I don't know when I'll be back, I dug a little deeper. According to the English Wikipedia page (linked above), one of the wiki page's Unicode sources (PDF), and some extra commentary from a useful blog post it's clear that these modifying marks only occur after certain letters when serving their intended alphabetical purposes: the three vowels ᱚ, ᱟ, & ᱮ can be followed by ᱹ to represent three other similar vowels; the same three vowels can be followed by ᱺ (which combines ᱹ with the nasalization marker ᱸ); and ᱼ is only used after the four ejective consonants ᱜ, ᱡ, ᱫ, and ᱵ.
  - Other uses do appear to be congruent with ASCII punctuation, like in timezones (+᱐᱕ᱺ᱓᱐ == +05:30), in combining proper names (ᱨᱮᱣᱟᱲᱤᱼᱡᱚᱭᱯᱩᱨᱼᱟᱡᱽᱢᱮᱨ == Rewari-Jaipur-Ajmer), in transliterating English acronyms (ᱵᱤᱹᱵᱤᱹᱥᱤ == ᱵᱤ.ᱵᱤ.ᱥᱤ == ᱵᱤ ᱵᱤ ᱥᱤ == "bee bee see" == BBC.. and all three Santali forms do occur on Santali Wikipedia), or as an ellipse (ᱹᱹᱹ), etc.
  - My approach is to replace the Santali modifier character when it does not follow one of the expected 3 or 4 Santali characters, or when it precedes a numeral (Santali, Arabic, or other, using \p{Nd})—that last one catches some oddball cases like ᱼ᱖᱐ (== "-60") regardless of where it occurs (it's unlikely that an article will start with that kind of text, but a query easily could). I've also configured globally-applied word_break_helper to place itself after these Santali-specific pattern filters (like we already do with an Armenian-specific filter), so that it can do whatever it does to normal periods, colons, or dashes—rather than trying to have the Santali pattern filters try to stay in sync by directly mapping ᱹ and ᱺ to spaces.

User:TJones (WMF)/Notes/Expanding Decimal Digit Normalization - MediaWiki

Implementation & Analysis