Login| Sign up

Opera Software ASA

Widgets » Science » Wørd - Language Analyzer

  • 41 downloads last 7 days
  • 97,697 total downloads
  • Rating: +53 (login to vote)
Wørd - Language Analyzer, Opera application V2.0 Oct 23, 2006 6:09:13 AM

Wørd is a widget that figures out what language a text is in.

Have you ever wondered what language a blog entry you glanced at might be in? Or are you having a hard time telling Norwegian from Danish? Wørd is a widget that figures out what language a text is in. Just paste the text snippet you're curious about into the widget and click go.

This widget is now maintained at:
What Language Is This?

Currently supported languages are:
English
German
French
Polish
Japanese
Dutch
Italian
Portuguese
Swedish
Spanish
Russian
Chinese
Finnish
Norwegian
Esperanto
Slovak
Danish
Czech
Hebrew
Catalan
Hungarian
Romanian
Indonesian
Serbian
Turkish
Slovenian
Lithuanian
Bulgarian
Ukranian
Korean
Estonian
Croatian
Telugu
Arabic
Malay
Persian
Thai
Greek
Basque
Bengali
Icelandic
Georgian
Bosnian
Vietnamese
Cantonese


Comments 38 posts

Log in at the top of the page to post a comment.

« Previous 21 - 38 of 38

Totally agreed with "highest rating". I can't rate technically, but the idea is perfect! :up:

By vinczej , # Nov 21, 2006 7:53:20 PM

Very cute! I tried it with some Latin, and it came up with Catalan, which is supposed to be very Latinish.

By grapefruitzzz , # Nov 13, 2006 7:11:33 AM

me27: :smile: the restriction on at least 20 characters is often too low for the widget to correctly identify the language. for reliable results it's best to at least fill the widget window with text. i'll try to improve it...

By hefa , # Nov 5, 2006 5:14:19 PM

COOL! That's a lot of languages. My only complaint is that it can be picky with setence length.

By me27 , # Nov 4, 2006 4:18:58 PM

Here's the language list for the pre-discriminator :smile:<table cellspacing=1 border=0><tr align=left><th>Start</th><th>End</th><th>Unicode Block Name</th></tr><tr bgcolor="#eeeeee"><td>U+0000</td><td>U+007F</td><td>Basic Latin</td></tr><tr><td>U+0080</td><td>U+00FF</td><td>Latin-1 Supplement</td></tr><tr bgcolor="#eeeeee"><td>U+0100</td><td>U+017F</td><td>Latin Extended-A</td></tr><tr><td>U+0180</td><td>U+024F</td><td>Latin Extended-B</td></tr><tr bgcolor="#eeeeee"><td>U+0370</td><td>U+03FF</td><td>Greek</td></tr><tr><td>U+0400</td><td>U+04FF</td><td>Cyrillic</td></tr><tr bgcolor="#eeeeee"><td>U+0530</td><td>U+058F</td><td>Armenian</td></tr><tr><td>U+0590</td><td>U+05FF</td><td>Hebrew</td></tr><tr bgcolor="#eeeeee"><td>U+0600</td><td>U+06FF</td><td>Arabic</td></tr><tr><td>U+2070</td><td>U+209F</td><td>Superscripts and Subscripts</td></tr><tr bgcolor="#eeeeee"><td>U+20A0</td><td>U+20CF</td><td>Currency Symbols</td></tr><tr><td>U+2190</td><td>U+21FF</td><td>Arrows</td></tr><tr bgcolor="#eeeeee"><td>U+2200</td><td>U+22FF</td><td>Mathematical Operators</td></tr><tr><td>U+2440</td><td>U+245F</td><td>Optical Character Recognition</td></tr><tr bgcolor="#eeeeee"><td>U+2500</td><td>U+257F</td><td>Box Drawing</td></tr><tr><td>U+2580</td><td>U+259F</td><td>Block Elements</td></tr><tr bgcolor="#eeeeee"><td>U+25A0</td><td>U+25FF</td><td>Geometric Shapes<td></tr>
<tr><td>U+2600</td><td>U+26FF</td><td>Miscellaneous Symbols</td></tr><tr bgcolor="#eeeeee"><td>U+2700</td><td>U+27BF</td><td>Dingbats</td></tr><tr><td>U+2800</td><td>U+28FF</td><td>Braille Patterns</td></tr><tr bgcolor="#eeeeee"><td>U+3000</td><td>U+303F</td><td>CJK Symbols and Punctuation</td></tr><tr><td>U+3040</td><td>U+309F</td><td>Hiragana</td></tr><tr bgcolor="#eeeeee"><td>U+30A0</td><td>U+30FF</td><td>Katakana</td></tr><tr><td>U+4E00</td><td>U+9FFF</td><td>CJK Unified Ideographs</td></tr><tr bgcolor="#eeeeee"><td>U+AC00</td><td>U+D7A3</td><td>Hangul Syllables</td></tr></table>Goodbye and thanks for all the ╠╡╢╣╤╥╦╧╨╩╪╫╬

By dantesoft , # Oct 31, 2006 9:15:35 PM

works here. win. v9.03 (Build 8629)
just the user-agent string is `Opera/9.10`

By dantesoft , # Oct 30, 2006 6:43:04 PM

Doesn't work on opera 9.10
can't paste text

By Sir_Yaro , # Oct 30, 2006 11:17:10 AM

Just to clarify on what dantesoft asked. "Chinese" in wikipedia is indeed Mandarin (中文). Its written mostly using traditional characters with some articles in simplified characters. It seems that most articles use Taiwanese standard Mandarin, which is subtly different from mainland standard Mandarin... The Cantonese wikipedia (粤语) is written in colloquial cantonese (probably mostly Hong Kong Cantonese), which uses a set of character markedly different from Mandarin.

By curwenx , # Oct 25, 2006 5:14:45 AM

thanks seifip, I'm glad you like it :smile:

By hefa , # Oct 24, 2006 11:20:51 PM

amazing widget!!! 5/5 & +fav :D

By seifip , # Oct 24, 2006 8:22:18 PM

Thanks AleksOD :D

By hefa , # Oct 24, 2006 5:33:59 PM

dantesoft, the points you make are very valid, well except for the one about separating japanese depending on which japanese script it uses; that makes no sense. :wink: And separating script from language is not really feasible, because that would require converting and correlating them, and in that case I might as well enroll in a Ph.D. program right now and spend my next ten years writing a short paper no one will bother to read while earning practically no money at all. :wink: (apologies to any Ph.D. student out there.)

Anyway, when I was testing different algorithms for this widget I came to one very strong conclusion: simple is good (bet you've heard that before). I tried many forms of weighing factors together in the algorithms, tried analyzing longer strings, and tried using larger databases in the widget.

Using larger databases on European languages written with latin script actually made no significant difference at all at the certainty of the result. Trying different (and to me apparently clever) forms of weighing frequencies of strings etc actually gave worse results than not weighing them. Using a larger corpus might fix that though...

Secondly, I strongly believe it's best to keep all algorithms in this widget, as well as the program I use for compiling the database used in this widget, as agnostic as possible. In other words, all algorithms simply analyze characters as if they're numbers (unicode codepoints).

The method you suggest with saying "this text is mostly hiragana, so this should be Japanese" would certainly work for Japanese. And I could implement that for the languages I know... but do I wanna read up on every language there is in the world in order to implement that logic? Well actually I do want to do that. But could I possibly gain the understanding necessary to implement that logic correctly for 40+ languages I don't speak?

I don't think the results would be very good. I can't see the difference between Persian (written with Arabic script) and Arabic, but all the tests I've run with this widget, it has been able to distinguish those two languages from each other. And that's because I kept it simple.

I can consider adding certainy levels for the results though, but showing numbers doesn't make much sense, since the standard against which they are taken as relative is completely arbitrary. So I don't think I'll give up the ease of use this widget currently has for that, at least not by default (I could add a special haxx for you :smile: ).

However, if you (or anyone else who's had the patience to read this far) find a significant text snipplet (i.e. at least the size that it fills the widget's textarea) written a decently correct form of a language this widget is supposed to support and it is misidentified, then please tell me (a link to/copy of the text, actual language, and mistaken language) and I'll investigate it and do my best to improve the identification of that language in upcoming revisions of this widget.

Moreover, if you think I should add some language, or an additional script of some language (provided that is a normal way to write that language), then please tell me, because I really want to do that, I just couldn't really find enough excuses to go on adding more languages before I got some indication anyone would actually use this widget. :wink: And of course please help me find a good source of text in that language in that case, because that's hard.

By hefa , # Oct 24, 2006 5:22:49 PM

Sweet!!! Excellent widget for doing what it is supposed to do!

By AleksOD , # Oct 24, 2006 4:55:31 PM

The discrimination I was refering to was more like this (depending on how much you want to implement, see list):
  • U+0400 – U+04FF: Cyrillic (Russian, Ukranian, Serbian, Macedonian, Bulgarian ...)
  • U+0600 – U+06FF: Arabic (Arabics, Farsi, Jawi, Kurdish, Pashto, Sindhi, Urdu...)
  • U+0530 – U+058F: Armenian
  • U+0590 – U+05FF: Hebrew
  • ...
So if 90% of the query text is in the Hiragana range, one can assume Japanese and start processing with this bias (maybe even separating the non-JP text to identify that 10%).

Originally posted by hefa:

Yes, right now the languages are only available for one script each.
It made sense to me to have `Japanese(Katakana)` and `Japanese(Hiragana)` and `Japanese(Kanji)` and indeed `Japanese(Romaji)` as supplemental results. Isn't it just a matter of getting the right corpus ?

That would add to the current "Language:" result, a supplemental "Script:" computation.

Originally posted by hefa:

Separating Serbian written with latin characters from Croatian would be almost impossible.
Indeed, it's politics that mainly separates the languages. Plus analyzing dialects/vocabularies is beyond the widget's scope e.g. English: Cooking salt is a compound of sodium chloride. Bosnian: Kuhinjska so je spoj natrija i hlora. Croatian: Kuhinjska sol je spoj natrija i klora. Serbian: Kuhinjska so je jedinjenje natrijuma i hlora.

One more issue: the code assumes standard languages. Can you consider adding a certainty level (something like "96.43% English", to account for the South African variants/Internet jargon or maybe even "80.43% Japanese [Kanji], 10.22% Japanese[Hiragana]") or would that needlessly complicate things ?

By dantesoft , # Oct 24, 2006 12:23:45 PM

Thank you for your feedback, dantesoft.

I do do simple discrimination in advance based on the frequency of characters used. But the problem is that CJK has a tremendously lower frequency of use even for the most frequently used characters simply because there are so darn many characters in those languages. But I will try to figure out some way of weighing in this factor.

Yes, right now the languages are only available for one script each. This is for two reasons: 1. that's how the wikipedia for that language is written (I got the corpus I used for analyzing the languages from wikipedia entries). 2. where does it end? romanized japanese is japanese? japanese written with hangul? of course, the line can be drawn differently. But from what I understand, separating Serbian written with latin characters from Croatian would be almost impossible. Please correct me if that is wrong. Given a good enough corpus, I'm willing to add any language to this widget. :D

Well, "Chinese" is whatever the "Chinese" edition of wikipedia is written in. :smile: But that is Mandarin, afaik. :smile:

By hefa , # Oct 24, 2006 9:49:12 AM

I think you could do a simple discriminator in advance, based on the Unicode ranges. I mean, it's `obvious` it's some unknown language (short text in Latin alphabet) and Japanese.

I installed only Japanese fonts, it's pretty clear to me when the language is Chinese (white boxes appear :smile:) and Korean sure looks different.

Speaking of script, it seems to process Farsi only in Arabic script. Also, I like how it has Bosnian, Croatian, Serbian:

Originally posted by "The world is left to the young":

"На млађима свет остаје" is Serbian but "Na mlađima svet ostaje" is Bosnian or Croatian.

But I guess that's just politics stepping over the land of mathematics :D

BTW: Is "Chinese" (standard) Mandarin ? I see you list "Cantonese"...

By dantesoft , # Oct 24, 2006 7:43:07 AM

dantesoft: Japanese, Chinese, or Korean with mixed latin characters makes the widget go a bit bananas as you noticed... it's because the CJK languages use so darn many characters... have to work on it. :S AdSenseではヴィジェットの画面に写れるかなぁ…営業手法新発明!

By hefa , # Oct 24, 2006 3:43:20 AM

Feeleeng loocky? Inter a seerch und try it oooot! :up: Bunoos fur dueeng it ell lucelly. Bork bork bork!

PS:
AdSense では、手間も費用もかけずに広告収入を増やすことができます。
Dutch, naturally :smile:

By dantesoft , # Oct 23, 2006 7:01:21 PM

  1. Pages:
  2. « Previous
  3. 1
  4. 2