Wørd is a widget that figures out what language a text is in.
Have you ever wondered what language a blog entry you glanced at might be in? Or are you having a hard time telling Norwegian from Danish? Wørd is a widget that figures out what language a text is in. Just paste the text snippet you're curious about into the widget and click go.
Currently supported languages are: English German French Polish Japanese Dutch Italian Portuguese Swedish Spanish Russian Chinese Finnish Norwegian Esperanto Slovak Danish Czech Hebrew Catalan Hungarian Romanian Indonesian Serbian Turkish Slovenian Lithuanian Bulgarian Ukranian Korean Estonian Croatian Telugu Arabic Malay Persian Thai Greek Basque Bengali Icelandic Georgian Bosnian Vietnamese Cantonese
me27: the restriction on at least 20 characters is often too low for the widget to correctly identify the language. for reliable results it's best to at least fill the widget window with text. i'll try to improve it...
Here's the language list for the pre-discriminator <table cellspacing=1 border=0><tr align=left><th>Start</th><th>End</th><th>Unicode Block Name</th></tr><tr bgcolor="#eeeeee"><td>U+0000</td><td>U+007F</td><td>Basic Latin</td></tr><tr><td>U+0080</td><td>U+00FF</td><td>Latin-1 Supplement</td></tr><tr bgcolor="#eeeeee"><td>U+0100</td><td>U+017F</td><td>Latin Extended-A</td></tr><tr><td>U+0180</td><td>U+024F</td><td>Latin Extended-B</td></tr><tr bgcolor="#eeeeee"><td>U+0370</td><td>U+03FF</td><td>Greek</td></tr><tr><td>U+0400</td><td>U+04FF</td><td>Cyrillic</td></tr><tr bgcolor="#eeeeee"><td>U+0530</td><td>U+058F</td><td>Armenian</td></tr><tr><td>U+0590</td><td>U+05FF</td><td>Hebrew</td></tr><tr bgcolor="#eeeeee"><td>U+0600</td><td>U+06FF</td><td>Arabic</td></tr><tr><td>U+2070</td><td>U+209F</td><td>Superscripts and Subscripts</td></tr><tr bgcolor="#eeeeee"><td>U+20A0</td><td>U+20CF</td><td>Currency Symbols</td></tr><tr><td>U+2190</td><td>U+21FF</td><td>Arrows</td></tr><tr bgcolor="#eeeeee"><td>U+2200</td><td>U+22FF</td><td>Mathematical Operators</td></tr><tr><td>U+2440</td><td>U+245F</td><td>Optical Character Recognition</td></tr><tr bgcolor="#eeeeee"><td>U+2500</td><td>U+257F</td><td>Box Drawing</td></tr><tr><td>U+2580</td><td>U+259F</td><td>Block Elements</td></tr><tr bgcolor="#eeeeee"><td>U+25A0</td><td>U+25FF</td><td>Geometric Shapes<td></tr> <tr><td>U+2600</td><td>U+26FF</td><td>Miscellaneous Symbols</td></tr><tr bgcolor="#eeeeee"><td>U+2700</td><td>U+27BF</td><td>Dingbats</td></tr><tr><td>U+2800</td><td>U+28FF</td><td>Braille Patterns</td></tr><tr bgcolor="#eeeeee"><td>U+3000</td><td>U+303F</td><td>CJK Symbols and Punctuation</td></tr><tr><td>U+3040</td><td>U+309F</td><td>Hiragana</td></tr><tr bgcolor="#eeeeee"><td>U+30A0</td><td>U+30FF</td><td>Katakana</td></tr><tr><td>U+4E00</td><td>U+9FFF</td><td>CJK Unified Ideographs</td></tr><tr bgcolor="#eeeeee"><td>U+AC00</td><td>U+D7A3</td><td>Hangul Syllables</td></tr></table>Goodbye and thanks for all the ╠╡╢╣╤╥╦╧╨╩╪╫╬
Just to clarify on what dantesoft asked. "Chinese" in wikipedia is indeed Mandarin (中文). Its written mostly using traditional characters with some articles in simplified characters. It seems that most articles use Taiwanese standard Mandarin, which is subtly different from mainland standard Mandarin... The Cantonese wikipedia (粤语) is written in colloquial cantonese (probably mostly Hong Kong Cantonese), which uses a set of character markedly different from Mandarin.
dantesoft, the points you make are very valid, well except for the one about separating japanese depending on which japanese script it uses; that makes no sense. And separating script from language is not really feasible, because that would require converting and correlating them, and in that case I might as well enroll in a Ph.D. program right now and spend my next ten years writing a short paper no one will bother to read while earning practically no money at all. (apologies to any Ph.D. student out there.)
Anyway, when I was testing different algorithms for this widget I came to one very strong conclusion: simple is good (bet you've heard that before). I tried many forms of weighing factors together in the algorithms, tried analyzing longer strings, and tried using larger databases in the widget.
Using larger databases on European languages written with latin script actually made no significant difference at all at the certainty of the result. Trying different (and to me apparently clever) forms of weighing frequencies of strings etc actually gave worse results than not weighing them. Using a larger corpus might fix that though...
Secondly, I strongly believe it's best to keep all algorithms in this widget, as well as the program I use for compiling the database used in this widget, as agnostic as possible. In other words, all algorithms simply analyze characters as if they're numbers (unicode codepoints).
The method you suggest with saying "this text is mostly hiragana, so this should be Japanese" would certainly work for Japanese. And I could implement that for the languages I know... but do I wanna read up on every language there is in the world in order to implement that logic? Well actually I do want to do that. But could I possibly gain the understanding necessary to implement that logic correctly for 40+ languages I don't speak?
I don't think the results would be very good. I can't see the difference between Persian (written with Arabic script) and Arabic, but all the tests I've run with this widget, it has been able to distinguish those two languages from each other. And that's because I kept it simple.
I can consider adding certainy levels for the results though, but showing numbers doesn't make much sense, since the standard against which they are taken as relative is completely arbitrary. So I don't think I'll give up the ease of use this widget currently has for that, at least not by default (I could add a special haxx for you ).
However, if you (or anyone else who's had the patience to read this far) find a significant text snipplet (i.e. at least the size that it fills the widget's textarea) written a decently correct form of a language this widget is supposed to support and it is misidentified, then please tell me (a link to/copy of the text, actual language, and mistaken language) and I'll investigate it and do my best to improve the identification of that language in upcoming revisions of this widget.
Moreover, if you think I should add some language, or an additional script of some language (provided that is a normal way to write that language), then please tell me, because I really want to do that, I just couldn't really find enough excuses to go on adding more languages before I got some indication anyone would actually use this widget. And of course please help me find a good source of text in that language in that case, because that's hard.
So if 90% of the query text is in the Hiragana range, one can assume Japanese and start processing with this bias (maybe even separating the non-JP text to identify that 10%).
Originally posted by hefa:
Yes, right now the languages are only available for one script each.
It made sense to me to have `Japanese(Katakana)` and `Japanese(Hiragana)` and `Japanese(Kanji)` and indeed `Japanese(Romaji)` as supplemental results. Isn't it just a matter of getting the right corpus ?
That would add to the current "Language:" result, a supplemental "Script:" computation.
Originally posted by hefa:
Separating Serbian written with latin characters from Croatian would be almost impossible.
Indeed, it's politics that mainly separates the languages. Plus analyzing dialects/vocabularies is beyond the widget's scope e.g. English:Cooking salt is a compound of sodium chloride.Bosnian:Kuhinjska so je spoj natrija i hlora.Croatian:Kuhinjska sol je spoj natrija i klora.Serbian:Kuhinjska so je jedinjenje natrijuma i hlora.
One more issue: the code assumes standard languages. Can you consider adding a certainty level (something like "96.43% English", to account for the South African variants/Internet jargon or maybe even "80.43% Japanese [Kanji], 10.22% Japanese[Hiragana]") or would that needlessly complicate things ?
I do do simple discrimination in advance based on the frequency of characters used. But the problem is that CJK has a tremendously lower frequency of use even for the most frequently used characters simply because there are so darn many characters in those languages. But I will try to figure out some way of weighing in this factor.
Yes, right now the languages are only available for one script each. This is for two reasons: 1. that's how the wikipedia for that language is written (I got the corpus I used for analyzing the languages from wikipedia entries). 2. where does it end? romanized japanese is japanese? japanese written with hangul? of course, the line can be drawn differently. But from what I understand, separating Serbian written with latin characters from Croatian would be almost impossible. Please correct me if that is wrong. Given a good enough corpus, I'm willing to add any language to this widget.
Well, "Chinese" is whatever the "Chinese" edition of wikipedia is written in. But that is Mandarin, afaik.
I think you could do a simple discriminator in advance, based on the Unicode ranges. I mean, it's `obvious` it's some unknown language (short text in Latin alphabet) and Japanese.
I installed only Japanese fonts, it's pretty clear to me when the language is Chinese (white boxes appear ) and Korean sure looks different.
Speaking of script, it seems to process Farsi only in Arabic script. Also, I like how it has Bosnian, Croatian, Serbian:
Originally posted by "The world is left to the young":
"На млађима свет остаје" is Serbian but "Na mlađima svet ostaje" is Bosnian or Croatian.
But I guess that's just politics stepping over the land of mathematics
BTW: Is "Chinese" (standard) Mandarin ? I see you list "Cantonese"...
dantesoft: Japanese, Chinese, or Korean with mixed latin characters makes the widget go a bit bananas as you noticed... it's because the CJK languages use so darn many characters... have to work on it. :S AdSenseではヴィジェットの画面に写れるかなぁ…営業手法新発明!
By vinczej , # Nov 21, 2006 7:53:20 PM
By grapefruitzzz , # Nov 13, 2006 7:11:33 AM
By hefa , # Nov 5, 2006 5:14:19 PM
By me27 , # Nov 4, 2006 4:18:58 PM
<tr><td>U+2600</td><td>U+26FF</td><td>Miscellaneous Symbols</td></tr><tr bgcolor="#eeeeee"><td>U+2700</td><td>U+27BF</td><td>Dingbats</td></tr><tr><td>U+2800</td><td>U+28FF</td><td>Braille Patterns</td></tr><tr bgcolor="#eeeeee"><td>U+3000</td><td>U+303F</td><td>CJK Symbols and Punctuation</td></tr><tr><td>U+3040</td><td>U+309F</td><td>Hiragana</td></tr><tr bgcolor="#eeeeee"><td>U+30A0</td><td>U+30FF</td><td>Katakana</td></tr><tr><td>U+4E00</td><td>U+9FFF</td><td>CJK Unified Ideographs</td></tr><tr bgcolor="#eeeeee"><td>U+AC00</td><td>U+D7A3</td><td>Hangul Syllables</td></tr></table>Goodbye and thanks for all the ╠╡╢╣╤╥╦╧╨╩╪╫╬
By dantesoft , # Oct 31, 2006 9:15:35 PM
just the user-agent string is `Opera/9.10`
By dantesoft , # Oct 30, 2006 6:43:04 PM
can't paste text
By Sir_Yaro , # Oct 30, 2006 11:17:10 AM
By curwenx , # Oct 25, 2006 5:14:45 AM
By hefa , # Oct 24, 2006 11:20:51 PM
By seifip , # Oct 24, 2006 8:22:18 PM
By hefa , # Oct 24, 2006 5:33:59 PM
Anyway, when I was testing different algorithms for this widget I came to one very strong conclusion: simple is good (bet you've heard that before). I tried many forms of weighing factors together in the algorithms, tried analyzing longer strings, and tried using larger databases in the widget.
Using larger databases on European languages written with latin script actually made no significant difference at all at the certainty of the result. Trying different (and to me apparently clever) forms of weighing frequencies of strings etc actually gave worse results than not weighing them. Using a larger corpus might fix that though...
Secondly, I strongly believe it's best to keep all algorithms in this widget, as well as the program I use for compiling the database used in this widget, as agnostic as possible. In other words, all algorithms simply analyze characters as if they're numbers (unicode codepoints).
The method you suggest with saying "this text is mostly hiragana, so this should be Japanese" would certainly work for Japanese. And I could implement that for the languages I know... but do I wanna read up on every language there is in the world in order to implement that logic? Well actually I do want to do that. But could I possibly gain the understanding necessary to implement that logic correctly for 40+ languages I don't speak?
I don't think the results would be very good. I can't see the difference between Persian (written with Arabic script) and Arabic, but all the tests I've run with this widget, it has been able to distinguish those two languages from each other. And that's because I kept it simple.
I can consider adding certainy levels for the results though, but showing numbers doesn't make much sense, since the standard against which they are taken as relative is completely arbitrary. So I don't think I'll give up the ease of use this widget currently has for that, at least not by default (I could add a special haxx for you
However, if you (or anyone else who's had the patience to read this far) find a significant text snipplet (i.e. at least the size that it fills the widget's textarea) written a decently correct form of a language this widget is supposed to support and it is misidentified, then please tell me (a link to/copy of the text, actual language, and mistaken language) and I'll investigate it and do my best to improve the identification of that language in upcoming revisions of this widget.
Moreover, if you think I should add some language, or an additional script of some language (provided that is a normal way to write that language), then please tell me, because I really want to do that, I just couldn't really find enough excuses to go on adding more languages before I got some indication anyone would actually use this widget.
By hefa , # Oct 24, 2006 5:22:49 PM
By AleksOD , # Oct 24, 2006 4:55:31 PM
- U+0400 – U+04FF: Cyrillic (Russian, Ukranian, Serbian, Macedonian, Bulgarian ...)
- U+0600 – U+06FF: Arabic (Arabics, Farsi, Jawi, Kurdish, Pashto, Sindhi, Urdu...)
- U+0530 – U+058F: Armenian
- U+0590 – U+05FF: Hebrew
- ...
So if 90% of the query text is in the Hiragana range, one can assume Japanese and start processing with this bias (maybe even separating the non-JP text to identify that 10%).Originally posted by hefa:
It made sense to me to have `Japanese(Katakana)` and `Japanese(Hiragana)` and `Japanese(Kanji)` and indeed `Japanese(Romaji)` as supplemental results. Isn't it just a matter of getting the right corpus ?That would add to the current "Language:" result, a supplemental "Script:" computation.
Originally posted by hefa:
Indeed, it's politics that mainly separates the languages. Plus analyzing dialects/vocabularies is beyond the widget's scope e.g. English: Cooking salt is a compound of sodium chloride. Bosnian: Kuhinjska so je spoj natrija i hlora. Croatian: Kuhinjska sol je spoj natrija i klora. Serbian: Kuhinjska so je jedinjenje natrijuma i hlora.One more issue: the code assumes standard languages. Can you consider adding a certainty level (something like "96.43% English", to account for the South African variants/Internet jargon or maybe even "80.43% Japanese [Kanji], 10.22% Japanese[Hiragana]") or would that needlessly complicate things ?
By dantesoft , # Oct 24, 2006 12:23:45 PM
I do do simple discrimination in advance based on the frequency of characters used. But the problem is that CJK has a tremendously lower frequency of use even for the most frequently used characters simply because there are so darn many characters in those languages. But I will try to figure out some way of weighing in this factor.
Yes, right now the languages are only available for one script each. This is for two reasons: 1. that's how the wikipedia for that language is written (I got the corpus I used for analyzing the languages from wikipedia entries). 2. where does it end? romanized japanese is japanese? japanese written with hangul? of course, the line can be drawn differently. But from what I understand, separating Serbian written with latin characters from Croatian would be almost impossible. Please correct me if that is wrong. Given a good enough corpus, I'm willing to add any language to this widget.
Well, "Chinese" is whatever the "Chinese" edition of wikipedia is written in.
By hefa , # Oct 24, 2006 9:49:12 AM
I installed only Japanese fonts, it's pretty clear to me when the language is Chinese (white boxes appear
Speaking of script, it seems to process Farsi only in Arabic script. Also, I like how it has Bosnian, Croatian, Serbian:
Originally posted by "The world is left to the young":
But I guess that's just politics stepping over the land of mathematics
BTW: Is "Chinese" (standard) Mandarin ? I see you list "Cantonese"...
By dantesoft , # Oct 24, 2006 7:43:07 AM
By hefa , # Oct 24, 2006 3:43:20 AM
PS:Dutch, naturally
By dantesoft , # Oct 23, 2006 7:01:21 PM