Wørd is a widget that figures out what language a text is in.
Have you ever wondered what language a blog entry you glanced at might be in? Or are you having a hard time telling Norwegian from Danish? Wørd is a widget that figures out what language a text is in. Just paste the text snippet you're curious about into the widget and click go.
Currently supported languages are: English German French Polish Japanese Dutch Italian Portuguese Swedish Spanish Russian Chinese Finnish Norwegian Esperanto Slovak Danish Czech Hebrew Catalan Hungarian Romanian Indonesian Serbian Turkish Slovenian Lithuanian Bulgarian Ukranian Korean Estonian Croatian Telugu Arabic Malay Persian Thai Greek Basque Bengali Icelandic Georgian Bosnian Vietnamese Cantonese
me27: the restriction on at least 20 characters is often too low for the widget to correctly identify the language. for reliable results it's best to at least fill the widget window with text. i'll try to improve it...
Here's the language list for the pre-discriminator <table cellspacing=1 border=0><tr align=left><th>Start</th><th>End</th><th>Unicode Block Name</th></tr><tr bgcolor="#eeeeee"><td>U+0000</td><td>U+007F</td><td>Basic Latin</td></tr><tr><td>U+0080</td><td>U+00FF</td><td>Latin-1 Supplement</td></tr><tr bgcolor="#eeeeee"><td>U+0100</td><td>U+017F</td><td>Latin Extended-A</td></tr><tr><td>U+0180</td><td>U+024F</td><td>Latin Extended-B</td></tr><tr bgcolor="#eeeeee"><td>U+0370</td><td>U+03FF</td><td>Greek</td></tr><tr><td>U+0400</td><td>U+04FF</td><td>Cyrillic</td></tr><tr bgcolor="#eeeeee"><td>U+0530</td><td>U+058F</td><td>Armenian</td></tr><tr><td>U+0590</td><td>U+05FF</td><td>Hebrew</td></tr><tr bgcolor="#eeeeee"><td>U+0600</td><td>U+06FF</td><td>Arabic</td></tr><tr><td>U+2070</td><td>U+209F</td><td>Superscripts and Subscripts</td></tr><tr bgcolor="#eeeeee"><td>U+20A0</td><td>U+20CF</td><td>Currency Symbols</td></tr><tr><td>U+2190</td><td>U+21FF</td><td>Arrows</td></tr><tr bgcolor="#eeeeee"><td>U+2200</td><td>U+22FF</td><td>Mathematical Operators</td></tr><tr><td>U+2440</td><td>U+245F</td><td>Optical Character Recognition</td></tr><tr bgcolor="#eeeeee"><td>U+2500</td><td>U+257F</td>
Just to clarify on what dantesoft asked. "Chinese" in wikipedia is indeed Mandarin (中文). Its written mostly using traditional characters with some articles in simplified characters. It seems that most articles use Taiwanese standard Mandarin, which is subtly different from mainland standard Mandarin... The Cantonese wikipedia (粤语) is written in colloquial cantonese (probably mostly Hong Kong Cantonese), which uses a set of character markedly different from Mandarin.
dantesoft, the points you make are very valid, well except for the one about separating japanese depending on which japanese script it uses; that makes no sense. And separating script from language is not really feasible, because that would require converting and correlating them, and in that case I might as well enroll in a Ph.D. program right now and spend my next ten years writing a short paper no one will bother to read while earning practically no money at all. (apologies to any Ph.D. student out there.)
Anyway, when I was testing different algorithms for this widget I came to one very strong conclusion: simple is good (bet you've heard that before). I tried many forms of weighing factors together in the algorithms, tried analyzing longer strings, and tried using larger databases in the widget.
Using larger databases on European languages written with latin script actually made no significant difference at all at the certainty of the result. Trying different (and to me apparently clever) forms of weighing frequencies of strings etc actually gave worse results than not weighing them. Using a larger corpus might fix that though...
So if 90% of the query text is in the Hiragana range, one can assume Japanese and start processing with this bias (maybe even separating the non-JP text to identify that 10%).
Originally posted by hefa:
Yes, right now the languages are only available for one script each.
It made sense to me to have `Japanese(Katakana)` and `Japanese(Hiragana)` and `Japanese(Kanji)` and indeed `Japanese(Romaji)` as supplemental results. Isn't it just a matter of getting the right corpus ?
That would add to the current "Language:" result, a supplemental "Script:" computation.
Originally posted by hefa:
Separating Serbian written with latin characters from Croatian would be almost impossible.
Indeed, it's politics that mainly separates the languages. Plus analyzing dialects/vocabularies is beyond the widget's scope
I do do simple discrimination in advance based on the frequency of characters used. But the problem is that CJK has a tremendously lower frequency of use even for the most frequently used characters simply because there are so darn many characters in those languages. But I will try to figure out some way of weighing in this factor.
Yes, right now the languages are only available for one script each. This is for two reasons: 1. that's how the wikipedia for that language is written (I got the corpus I used for analyzing the languages from wikipedia entries). 2. where does it end? romanized japanese is japanese? japanese written with hangul? of course, the line can be drawn differently. But from what I understand, separating Serbian written with latin characters from Croatian would be almost impossible. Please correct me if that is wrong. Given a good enough corpus, I'm willing to add any language to this widget.
Well, "Chinese" is whatever the "Chinese" edition of wikipedia is written in. But that is Mandarin, afaik.
I think you could do a simple discriminator in advance, based on the Unicode ranges. I mean, it's `obvious` it's some unknown language (short text in Latin alphabet) and Japanese.
I installed only Japanese fonts, it's pretty clear to me when the language is Chinese (white boxes appear ) and Korean sure looks different.
Speaking of script, it seems to process Farsi only in Arabic script. Also, I like how it has Bosnian, Croatian, Serbian:
Originally posted by "The world is left to the young":
"На млађима свет остаје" is Serbian but "Na mlađima svet ostaje" is Bosnian or Croatian.
But I guess that's just politics stepping over the land of mathematics
BTW: Is "Chinese" (standard) Mandarin ? I see you list "Cantonese"...
dantesoft: Japanese, Chinese, or Korean with mixed latin characters makes the widget go a bit bananas as you noticed... it's because the CJK languages use so darn many characters... have to work on it. :S AdSenseではヴィジェットの画面に写れるかなぁ…営業手法新発明!
By grapefruitzzz , # Nov 13, 2006 7:11:33 AM
By hefa , # Nov 5, 2006 5:14:19 PM
By me27 , # Nov 4, 2006 4:18:58 PM
By dantesoft , # Oct 31, 2006 9:15:35 PM
just the user-agent string is `Opera/9.10`
By dantesoft , # Oct 30, 2006 6:43:04 PM
can't paste text
By Sir_Yaro , # Oct 30, 2006 11:17:10 AM
By curwenx , # Oct 25, 2006 5:14:45 AM
By hefa , # Oct 24, 2006 11:20:51 PM
By seifip , # Oct 24, 2006 8:22:18 PM
By hefa , # Oct 24, 2006 5:33:59 PM
Anyway, when I was testing different algorithms for this widget I came to one very strong conclusion: simple is good (bet you've heard that before). I tried many forms of weighing factors together in the algorithms, tried analyzing longer strings, and tried using larger databases in the widget.
Using larger databases on European languages written with latin script actually made no significant difference at all at the certainty of the result. Trying different (and to me apparently clever) forms of weighing frequencies of strings etc actually gave worse results than not weighing them. Using a larger corpus might fix that though...
Secondly, I strongly believe it's best
By hefa , # Oct 24, 2006 5:22:49 PM
By AleksOD , # Oct 24, 2006 4:55:31 PM
- U+0400 – U+04FF: Cyrillic (Russian, Ukranian, Serbian, Macedonian, Bulgarian ...)
- U+0600 – U+06FF: Arabic (Arabics, Farsi, Jawi, Kurdish, Pashto, Sindhi, Urdu...)
- U+0530 – U+058F: Armenian
- U+0590 – U+05FF: Hebrew
- ...
So if 90% of the query text is in the Hiragana range, one can assume Japanese and start processing with this bias (maybe even separating the non-JP text to identify that 10%).Originally posted by hefa:
It made sense to me to have `Japanese(Katakana)` and `Japanese(Hiragana)` and `Japanese(Kanji)` and indeed `Japanese(Romaji)` as supplemental results. Isn't it just a matter of getting the right corpus ?That would add to the current "Language:" result, a supplemental "Script:" computation.
Originally posted by hefa:
Indeed, it's politics that mainly separates the languages. Plus analyzing dialects/vocabularies is beyond the widget's scopeBy dantesoft , # Oct 24, 2006 12:23:45 PM
I do do simple discrimination in advance based on the frequency of characters used. But the problem is that CJK has a tremendously lower frequency of use even for the most frequently used characters simply because there are so darn many characters in those languages. But I will try to figure out some way of weighing in this factor.
Yes, right now the languages are only available for one script each. This is for two reasons: 1. that's how the wikipedia for that language is written (I got the corpus I used for analyzing the languages from wikipedia entries). 2. where does it end? romanized japanese is japanese? japanese written with hangul? of course, the line can be drawn differently. But from what I understand, separating Serbian written with latin characters from Croatian would be almost impossible. Please correct me if that is wrong. Given a good enough corpus, I'm willing to add any language to this widget.
Well, "Chinese" is whatever the "Chinese" edition of wikipedia is written in.
By hefa , # Oct 24, 2006 9:49:12 AM
I installed only Japanese fonts, it's pretty clear to me when the language is Chinese (white boxes appear
Speaking of script, it seems to process Farsi only in Arabic script. Also, I like how it has Bosnian, Croatian, Serbian:
Originally posted by "The world is left to the young":
But I guess that's just politics stepping over the land of mathematics
BTW: Is "Chinese" (standard) Mandarin ? I see you list "Cantonese"...
By dantesoft , # Oct 24, 2006 7:43:07 AM
By hefa , # Oct 24, 2006 3:43:20 AM
PS:Dutch, naturally
By dantesoft , # Oct 23, 2006 7:01:21 PM