User:TJones (WMF)/Notes/TextCat Optimization for ptwiki ruwiki and jawiki - MediaWiki
Jump to content
From mediawiki.org
User:TJones (WMF)
Notes
(Redirected from
User:TJones (WMF)/Notes/TextCat Optimization for ptwiki ruwiki jawiki and idwiki
July 2016 — See
TJones_(WMF)/Notes
for other projects. (Phabricator ticket:
T138315
TextCat Optimization for ptwiki, ruwiki, and jawiki
edit
Summary of Results
edit
Using the default 3K models, the best options for each wiki are presented below:
ptwiki
languages: Portuguese, English, Russian, Hebrew, Arabic, Chinese, Korean, Greek
lang codes: pt, en, ru, he, ar, zh, ko, el
relevant poor-performing queries: 46%
0.5
: 96.9%
ruwiki
languages: Russian, English, Ukrainian, Georgian, Armenian, Japanese, Arabic, Hebrew, Chinese
lang codes: ru, en, uk, ka, hy, ja, ar, he, zh
relevant poor-performing queries: 30.5%
0.5
: 92.4%
jawiki
languages: Japanese, English, Russian, Korean, Arabic, Hebrew
lang codes: ja, en, ru, ko, ar, he
relevant poor-performing queries: 50%
0.5
: 95.1%
Background
edit
See the earlier report on frwiki, eswiki, itwiki, and dewiki for information on
how the corpora were created
Portuguese Results
edit
About 12% of the original 10K corpus was removed in the initial filtering. A 1000-query random sample was taken, and 48% of those queries were discarded, leaving a 524-query corpus. Thus only about 46% of low-performing queries are in an identifiable language.
Other languages searched on ptwiki
edit
Based on the sample of 524 poor-performing queries on ptwiki that are in some language, about 80% are in Portuguese, 4% in English, and fewer than 1% each are in a handful of other languages.
Below are the results for ptwiki, with raw counts, percentage, and 95% margin of error.
count
lg
+/-
490
pt
93.51%
2.11%
25
en
4.77%
1.83%
es
0.76%
0.75%
tl
0.19%
0.37%
ru
0.19%
0.37%
nl
0.19%
0.37%
la
0.19%
0.37%
fr
0.19%
0.37%
In order, those are Portuguese, English, Spanish, Tagalog, Russian, Dutch, Latin, and French.
We don’t have query-trained language models for all of the languages represented here, namely Tagalog and Latin. Since these each represent very small slices of our corpus (1 query each), we aren’t going to worry about them, and accept that they will not be detected correctly.
Looking at the larger corpus of 8797 remaining queries after the initial filtering, focusing on queries in other writing systems, there are also a small number of Hebrew, Arabic, Chinese, Korean, and Greek queries, and Burmese (for which we do not have models).
Analysis and Optimization
edit
Using all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and some of the models that did better are here:
model size
3000
5000 6000 9000 10000
TOTAL
86.8%
87.4% 88.0% 88.2% 88.7%
Portuguese
93.2%
93.6% 93.9% 94.2% 94.5%
78.4%
80.0% 81.6% 76.6% 76.6%
Spanish
13.1%
13.6% 13.8% 14.3% 15.1%
Dutch
28.6%
25.0% 25.0% 28.6% 33.3%
French
28.6%
33.3% 40.0% 33.3% 28.6%
Latin
0.0%
0.0% 0.0% 0.0% 0.0%
Russian
100.0%
100.0% 100.0% 100.0% 100.0%
Tagalog
0.0%
0.0% 0.0% 0.0% 0.0%
Performance details for the 3K model are here (details for larger models are similar in terms of which language models perform the most poorly):
f0.5 f1 f2 recall prec total hits misses
TOTAL 86.8% 86.8% 86.8% 86.8% 86.8% 524 455 69
Portuguese 97.2% 93.2% 89.6% 87.3% 100.0% 490 428 0
English 77.5% 78.4% 79.4% 80.0% 76.9% 25 20 6
Spanish 8.6% 13.1% 27.4% 100.0% 7.0% 4 4 53
Dutch 20.0% 28.6% 50.0% 100.0% 16.7% 1 1 5
French 20.0% 28.6% 50.0% 100.0% 16.7% 1 1 5
Latin 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
Russian 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0
Tagalog 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
f0.5 f1 f2 recall prec total hits misses
Spanish does very poorly, with way too many false positives. Dutch and French aren’t terrible in terms of raw false positives, but aren’t great, either.
As noted above, Hebrew, Arabic, Chinese, Korean, and Greek are present in the larger sample, and as our models for these languages are very high accuracy, I’ve included them.
The final language set is Portuguese, English, Russian, Hebrew, Arabic, Chinese, Korean, and Greek. With these languages, 3K is the optimal model size. The 3K results are shown below along with other top-performing model sizes:
model size 2500
3000
9000 10000
TOTAL 96.9%
96.9%
96.9% 96.9%
Portuguese 98.9%
98.8%
98.7% 98.7%
English 79.4%
80.6%
82.0% 81.4%
Spanish 0.0%
0.0%
0.0% 0.0%
Dutch 0.0%
0.0%
0.0% 0.0%
French 0.0%
0.0%
0.0% 0.0%
Latin 0.0%
0.0%
0.0% 0.0%
Russian 100.0%
100.0%
100.0% 100.0%
Tagalog 0.0%
0.0%
0.0% 0.0%
The detailed report for the 3K model is here:
f0.5 f1 f2 recall prec total hits misses
TOTAL 96.9% 96.9% 96.9% 96.9% 96.9% 524 508 16
Portuguese 99.0% 98.8% 98.5% 98.4% 99.2% 490 482 4
English 72.3% 80.6% 91.2% 100.0% 67.6% 25 25 12
Spanish 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0
Dutch 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
French 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
Latin 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
Russian 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0
Tagalog 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
f0.5 f1 f2 recall prec total hits misses
Recall went up and precision went down for Portuguese and English, but overall performance improved. Queries in unrepresented languages were all identified as English, except for Spanish queries, which were identified as Portuguese (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall.
ptwiki: Best Options
edit
The optimal settings for ptwiki, based on these experiments, would be to use models for
Portuguese, English, Russian, Hebrew, Arabic, Chinese, Korean, Greek (pt, en, ru, he, ar, zh, ko, el),
using the default 3000-ngram models.
Russian Results
edit
About 10.7% of the original 10K corpus was removed in the initial filtering. A 1500-query random sample was taken, and 65.8% of those queries were discarded, leaving a 512-query corpus. Thus only about 30.5% of low-performing queries are in an identifiable language.
Other languages searched on ruwiki
edit
Based on the sample of 512 poor-performing queries on ruwiki that are in some language, about 77% are in Russian, >10% in English, <5% in Ukrainian, and fewer than 1% each are in a handful of other languages.
Below are the results for ruwiki, with raw counts, percentage, and 95% margin of error.
count
lg
+/-
394
ru
76.95%
3.65%
67
en
13.09%
2.92%
25
uk
4.88%
1.87%
kk
0.78%
0.76%
de
0.78%
0.76%
ka
0.59%
0.66%
uz
0.39%
0.54%
ky
0.39%
0.54%
hy
0.39%
0.54%
ro
0.20%
0.38%
lv
0.20%
0.38%
ja
0.20%
0.38%
it
0.20%
0.38%
fr
0.20%
0.38%
fi
0.20%
0.38%
es
0.20%
0.38%
az
0.20%
0.38%
ar
0.20%
0.38%
In order, those are Russian, English, Ukrainian, Kazakh, German, Georgian, Uzbek, Kirghiz, Armenian, Romanian, Latvian, Japanese, Italian, French, Finnish, Spanish, Azerbaijani, and Arabic.
We don’t have query-trained language models for all of the languages represented here, such as Azerbaijani, Finnish, Kazakh, Kirghiz, Latvian, Romanian, and Uzbek. Since these each represent very small slices of our corpus (< 5 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.
Looking at the larger corpus of 8,931 remaining queries after the initial filtering, focusing on queries in other writing systems, there are also a small number of Hebrew and Chinese queries.
Analysis and Optimization
edit
Using all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and some of the models that did better are here:
model size
3000
4500 5000 7000
TOTAL
88.5%
89.5% 90.0% 91.2%
Russian
96.7%
96.8% 97.2% 97.6%
76.4%
80.0% 78.9% 82.1%
Ukrainian
67.7%
68.9% 72.4% 78.0%
German
40.0%
53.3% 50.0% 53.3%
Kazakh
0.0%
0.0% 0.0% 0.0%
Georgian
100.0%
100.0% 100.0% 100.0%
Armenian
100.0%
100.0% 100.0% 100.0%
Kirghiz
0.0%
0.0% 0.0% 0.0%
Uzbek
0.0%
0.0% 0.0% 0.0%
Arabic
100.0%
100.0% 100.0% 100.0%
Azerbaijani
0.0%
0.0% 0.0% 0.0%
Finnish
0.0%
0.0% 0.0% 0.0%
French
0.0%
0.0% 33.3% 40.0%
Italian
20.0%
18.2% 20.0% 22.2%
Japanese
100.0%
100.0% 100.0% 100.0%
Latvian
0.0%
0.0% 0.0% 0.0%
Romanian
0.0%
0.0% 0.0% 0.0%
Spanish
0.0%
0.0% 0.0% 0.0%
Performance details for the 3K model are here (details for larger models are similar in terms of which language models perform the most poorly):
f0.5 f1 f2 recall prec total hits misses
TOTAL 88.5% 88.5% 88.5% 88.5% 88.5% 512 453 59
Russian 97.1% 96.7% 96.2% 95.9% 97.4% 394 378 10
English 87.9% 76.4% 67.5% 62.7% 97.7% 67 42 1
Ukrainian 60.7% 67.7% 76.6% 84.0% 56.8% 25 21 16
German 29.4% 40.0% 62.5% 100.0% 25.0% 4 4 12
Kazakh 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0
Georgian 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0
Armenian 100.0% 100.0% 100.0% 100.0% 100.0% 2 2 0
Kirghiz 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0
Uzbek 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0
Arabic 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0
Azerbaijani 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
Finnish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
French 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 5
Italian 13.5% 20.0% 38.5% 100.0% 11.1% 1 1 8
Japanese 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0
Latvian 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
Romanian 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
Spanish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 7
f0.5 f1 f2 recall prec total hits misses
French, Spanish, Italian, and German all do very poorly, with too many false positives.
As noted above, Hebrew and Chinese are present in the larger sample, and as our models for these languages are very high accuracy, I’ve included them.
The final language set is Russian, English, Ukrainian, Georgian, Armenian, Japanese, Arabic, Hebrew, Chinese. As above, 3K is not the optimal model size, but it is within 1.5%. The 3K results are shown below along with the best performing model sizes:
model size
3000
4500 5000 7000
TOTAL
92.4%
92.6% 93.2% 93.8%
Russian
96.7%
96.8% 97.2% 97.6%
91.2%
91.2% 91.2% 91.2%
Ukrainian
67.7%
68.9% 72.4% 78.0%
German
0.0%
0.0% 0.0% 0.0%
Kazakh
0.0%
0.0% 0.0% 0.0%
Georgian
100.0%
100.0% 100.0% 100.0%
Armenian
100.0%
100.0% 100.0% 100.0%
Kirghiz
0.0%
0.0% 0.0% 0.0%
Uzbek
0.0%
0.0% 0.0% 0.0%
Arabic
100.0%
100.0% 100.0% 100.0%
Azerbaijani
0.0%
0.0% 0.0% 0.0%
Finnish
0.0%
0.0% 0.0% 0.0%
French
0.0%
0.0% 0.0% 0.0%
Italian
0.0%
0.0% 0.0% 0.0%
Japanese
100.0%
100.0% 100.0% 100.0%
Latvian
0.0%
0.0% 0.0% 0.0%
Romanian
0.0%
0.0% 0.0% 0.0%
Spanish 0.0% 0.0% 0.0% 0.0%
The accuracy is very high, and the differences are reasonably small, so it makes sense to stick with the default 3K models for now, but keep an eye out for significant performance improvements with other model sizes.
The detailed report for the 3K model is here:
f0.5 f1 f2 recall prec total hits misses
TOTAL 92.4% 92.4% 92.4% 92.4% 92.4% 512 473 39
Russian 97.1% 96.7% 96.2% 95.9% 97.4% 394 378 10
English 86.6% 91.2% 96.3% 100.0% 83.8% 67 67 13
Ukrainian 60.7% 67.7% 76.6% 84.0% 56.8% 25 21 16
German 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0
Kazakh 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0
Georgian 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0
Armenian 100.0% 100.0% 100.0% 100.0% 100.0% 2 2 0
Kirghiz 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0
Uzbek 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0
Arabic 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0
Azerbaijani 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
Finnish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
French 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
Italian 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
Japanese 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0
Latvian 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
Romanian 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
Spanish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
f0.5 f1 f2 recall prec total hits misses
Recall went way up and precision went down for English, but overall performance improved. Queries in unrepresented languages were all identified as English (decreasing precision), but those now unused models are no longer generating lots of false positives and bringing down precision overall.
ruwiki: Best Options
edit
The slightly sub-optimal settings (though consistent with others using 3K models) for ruwiki, based on these experiments, would be to use models for
Russian, English, Ukrainian, Georgian, Armenian, Japanese, Arabic, Hebrew, Chinese (ru, en, uk, ka, hy, ja, ar, he, zh),
using the default 3000-ngram models.
Notes on Latin Russian, Cyrillic English, etc.
edit
Since I recently did some work on typing on the
wrong keyboard in Russian and English
, I enabled the models for Latin Russian and Cyrillic English for the first 1000 random samples I looked at. I did not include the additional filters mentioned in my previous write up, since I only use the models at that stage to roughly group queries for manual review.
Of the 21 (2.1%) identified as Cyrillic English (i.e., English typed on a Russian or other Cyrillic keyboard),
6 were Cyrillic English (including 2 very short acronyms)
1 was mixed (Cyrillic/Latin), but it converted to something plausible
8 were Russian/Cyrillic (including names, acronyms, typos)
3 more were very short (2-3 letters)
3 were junk
Of the 16 (1.6%) identified as Latin Russian (i.e., Russian typed on an American English or other Latin keyboard),
13 were Latin Russian/Cyrillic (including names)
1 was a name in Cyrillic
2 were apparent junk (1 of which was also mixed Cyrillic/Latin)
In passing, while working on the queries, I also noticed:
several Russian queries transliterated into Latin, sometimes identified as Polish, sometimes mixed with English
a few Latin queries (including names) transliterated into Russian
at least one each of Georgian and Armenian transliterated into Latin
a couple of cases of Devanagari transliterated into Cyrillic
Sounds like there is a decent-sized chunk of queries to improve by identifying and transliterating queries. Phonetic keyboards or transliterated queries will be harder, since they at least look like language even in the wrong character set (i.e., there are enough vowels in reasonable places).
Japanese Results
edit
About 7% of the original 10K corpus was removed in the initial filtering. A 1000-query random sample was taken, and 47% of those queries were discarded, leaving a 534-query corpus. Thus only about 50% of low-performing queries are in an identifiable language.
Notes on language identification
edit
It’s not uncommon to see Japanese names—of people, places, anime, manga—transliterated into Latin characters in all of the query corpora that I’ve looked at, but there are a lot more in jawiki. So the Manga “One Piece” in Japanese is “ワンピース” (“One Piece” transliterated in Japanese), but “Wan Pisu” (the Japanese name transliterated back into the Latin alphabet) is somewhere in between. For the purposes of this analysis, these were discarded as “names” (though something like “one piece”, while being a name, is made up of English words).
Similar transliteration for Indian movies is also common, though not in jawiki.
Other languages searched on jawiki
edit
Based on the sample of 534 poor-performing queries on jawiki that are in some language, almost 90% are in Japanes, about 6% are in English, 4% in Chinese, and fewer than 1% each are in a handful of other languages.
Below are the results for jawiki, with raw counts, percentage, and 95% margin of error.
count
lg
+/-
474
ja
88.76%
2.68%
33
en
6.18%
2.04%
23
zh
4.31%
1.72%
ru
0.19%
0.37%
ko
0.19%
0.37%
kk
0.19%
0.37%
de
0.19%
0.37%
In order, those are Japanese, English, Chinese, Russian, Korean, Kazakh, and German.
We don’t have query-trained language models for all of the languages represented here, in particular Kazakh. Since it represents a tiny slice of our corpus (1 query), we aren’t going to worry about it, and accept that it will not be detected correctly.
Looking at the larger corpus of 9292 remaining queries after the initial filtering, focusing on queries in other writing systems, there are also a small number of Arabic and Hebrew queries.
Analysis and Optimization
edit
Using all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and some of the models that did better are here:
model size
3000
3500 4000 4500 6000 7000 8000 9000 10000
TOTAL
88.2%
88.8% 89.1% 89.5% 89.9% 90.4% 90.6% 91.2% 91.6%
Japanese
93.4%
93.6% 93.9% 94.1% 94.6% 94.9% 95.0% 95.3% 95.5%
96.9%
98.5% 98.5% 98.5% 93.7% 93.7% 93.7% 95.2% 95.2%
Chinese
41.2%
40.8% 41.7% 43.8% 45.7% 47.2% 47.7% 50.6% 51.8%
German
40.0%
50.0% 50.0% 50.0% 40.0% 40.0% 40.0% 40.0% 40.0%
Kazakh
0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
Korean
100.0%
100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
Russian
66.7%
66.7% 66.7% 66.7% 66.7% 66.7% 66.7% 66.7% 66.7%
Performance details for the 3K model are here (details for larger models are similar in terms of which language models perform the most poorly):
f0.5 f1 f2 recall prec total hits misses
TOTAL 88.2% 88.2% 88.2% 88.2% 88.2% 534 471 63
Japanese 97.1% 93.4% 89.9% 87.8% 99.8% 474 416 1
English 98.7% 96.9% 95.1% 93.9% 100.0% 33 31 0
Chinese 31.0% 41.2% 61.4% 91.3% 26.6% 23 21 58
German 29.4% 40.0% 62.5% 100.0% 25.0% 1 1 3
Kazakh 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
Korean 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0
Russian 55.6% 66.7% 83.3% 100.0% 50.0% 1 1 1
f0.5 f1 f2 recall prec total hits misses
Chinese does very poorly, with too many false positives. German isn’t terrible in terms of raw false positives, but isn’t great, either. (The Russian false positive is the Kazakh query, which we don’t have a model for.)
As noted above, Arabic and Hebrew are present in the larger sample, and as our models for these languages are very high accuracy, I’ve included them.
The final language set is Japanese, English, Russian, Korean, Arabic, Hebrew. As above, 3K is not the optimal model size, but it is very close once optimized by language, with most of the advantage of the larger models closed by using a better set of languages. The 3K results are shown below along with the best performing model sizes:
model size
3000
4500
TOTAL
95.1%
95.3%
Japanese
98.1%
98.1%
89.2%
91.7%
Chinese
0.0%
0.0%
German
0.0%
0.0%
Kazakh
0.0%
0.0%
Korean
100.0%
100.0%
Russian
66.7%
66.7%
The accuracy is very high, and the differences are very small (~0.2%), so it makes sense to stick with the default 3K models for now, but keep an eye out for significant performance improvements with other model sizes.
The detailed report for the 3K model is here:
f0.5 f1 f2 recall prec total hits misses
TOTAL 95.1% 95.1% 95.1% 95.1% 95.1% 534 508 26
Japanese 97.2% 98.1% 99.1% 99.8% 96.5% 474 473 17
English 93.2% 95.7% 98.2% 100.0% 91.7% 33 33 3
Chinese 0.0% 0.0% 0.0% 0.0% 0.0% 23 0 0
German 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
Kazakh 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0
Korean 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0
Russian 55.6% 66.7% 83.3% 100.0% 50.0% 1 1 1
Arabic 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 5
f0.5 f1 f2 recall prec total hits misses
Recall went up and precision went down for Japanese and English, but overall performance improved. Most queries in the unrepresented languages were identified as either Japanese of English (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall. The Kazakh query showed up as Russian and the German as English (due to using the same character set). A small number of Chinese queries were tagged as Arabic, probably because they have no characters in common with any of the models, so all the models scored the same, and tied results are sorted alphabetically. (This supports the idea of adding
some additional sort of confidence measure to TextCat
.)
jawiki: Best Options
edit
The barely sub-optimal settings (though consistent with others using 3K models) for jawiki, based on these experiments, would be to use models for
Japanese, English, Russian, Korean, Arabic, Hebrew (ja, en, ru, ko, ar, he),
using the default 3000-ngram models.
Retrieved from "
Category
Discovery
User
TJones (WMF)/Notes/TextCat Optimization for ptwiki ruwiki and jawiki
Add topic
US