Sign up

Widgets » Science » Wørd - Language Analyzer

  • 207 downloads last 7 days
  • 88,941 total downloads
  • Rating: +52 (login to vote)
Wørd - Language Analyzer V2.0 Oct 23, 2006 6:09:13 AM

Wørd is a widget that figures out what language a text is in.

Have you ever wondered what language a blog entry you glanced at might be in? Or are you having a hard time telling Norwegian from Danish? Wørd is a widget that figures out what language a text is in. Just paste the text snippet you're curious about into the widget and click go.

This widget is now maintained at:
What Language Is This?

Currently supported languages are:
English
German
French
Polish
Japanese
Dutch
Italian
Portuguese
Swedish
Spanish
Russian
Chinese
Finnish
Norwegian
Esperanto
Slovak
Danish
Czech
Hebrew
Catalan
Hungarian
Romanian
Indonesian
Serbian
Turkish
Slovenian
Lithuanian
Bulgarian
Ukranian
Korean
Estonian
Croatian
Telugu
Arabic
Malay
Persian
Thai
Greek
Basque
Bengali
Icelandic
Georgian
Bosnian
Vietnamese
Cantonese


Comments 37 posts

Log in at the top of the page to post a comment.

previous 21 - 37 of 37

Very cute! I tried it with some Latin, and it came up with Catalan, which is supposed to be very Latinish.

By grapefruitzzz , # Nov 13, 2006 7:11:33 AM

me27: :smile: the restriction on at least 20 characters is often too low for the widget to correctly identify the language. for reliable results it's best to at least fill the widget window with text. i'll try to improve it...

By hefa , # Nov 5, 2006 5:14:19 PM

COOL! That's a lot of languages. My only complaint is that it can be picky with setence length.

By me27 , # Nov 4, 2006 4:18:58 PM

Here's the language list for the pre-discriminator :smile:<table cellspacing=1 border=0><tr align=left><th>Start</th><th>End</th><th>Unicode Block Name</th></tr><tr bgcolor="#eeeeee"><td>U+0000</td><td>U+007F</td><td>Basic Latin</td></tr><tr><td>U+0080</td><td>U+00FF</td><td>Latin-1 Supplement</td></tr><tr bgcolor="#eeeeee"><td>U+0100</td><td>U+017F</td><td>Latin Extended-A</td></tr><tr><td>U+0180</td><td>U+024F</td><td>Latin Extended-B</td></tr><tr bgcolor="#eeeeee"><td>U+0370</td><td>U+03FF</td><td>Greek</td></tr><tr><td>U+0400</td><td>U+04FF</td><td>Cyrillic</td></tr><tr bgcolor="#eeeeee"><td>U+0530</td><td>U+058F</td><td>Armenian</td></tr><tr><td>U+0590</td><td>U+05FF</td><td>Hebrew</td></tr><tr bgcolor="#eeeeee"><td>U+0600</td><td>U+06FF</td><td>Arabic</td></tr><tr><td>U+2070</td><td>U+209F</td><td>Superscripts and Subscripts</td></tr><tr bgcolor="#eeeeee"><td>U+20A0</td><td>U+20CF</td><td>Currency Symbols</td></tr><tr><td>U+2190</td><td>U+21FF</td><td>Arrows</td></tr><tr bgcolor="#eeeeee"><td>U+2200</td><td>U+22FF</td><td>Mathematical Operators</td></tr><tr><td>U+2440</td><td>U+245F</td><td>Optical Character Recognition</td></tr><tr bgcolor="#eeeeee"><td>U+2500</td><td>U+257F</td>

By dantesoft , # Oct 31, 2006 9:15:35 PM

works here. win. v9.03 (Build 8629)
just the user-agent string is `Opera/9.10`

By dantesoft , # Oct 30, 2006 6:43:04 PM

Doesn't work on opera 9.10
can't paste text

By Sir_Yaro , # Oct 30, 2006 11:17:10 AM

Just to clarify on what dantesoft asked. "Chinese" in wikipedia is indeed Mandarin (中文). Its written mostly using traditional characters with some articles in simplified characters. It seems that most articles use Taiwanese standard Mandarin, which is subtly different from mainland standard Mandarin... The Cantonese wikipedia (粤语) is written in colloquial cantonese (probably mostly Hong Kong Cantonese), which uses a set of character markedly different from Mandarin.

By curwenx , # Oct 25, 2006 5:14:45 AM

thanks seifip, I'm glad you like it :smile:

By hefa , # Oct 24, 2006 11:20:51 PM

amazing widget!!! 5/5 & +fav :D

By seifip , # Oct 24, 2006 8:22:18 PM

Thanks AleksOD :D

By hefa , # Oct 24, 2006 5:33:59 PM

dantesoft, the points you make are very valid, well except for the one about separating japanese depending on which japanese script it uses; that makes no sense. :wink: And separating script from language is not really feasible, because that would require converting and correlating them, and in that case I might as well enroll in a Ph.D. program right now and spend my next ten years writing a short paper no one will bother to read while earning practically no money at all. :wink: (apologies to any Ph.D. student out there.)

Anyway, when I was testing different algorithms for this widget I came to one very strong conclusion: simple is good (bet you've heard that before). I tried many forms of weighing factors together in the algorithms, tried analyzing longer strings, and tried using larger databases in the widget.

Using larger databases on European languages written with latin script actually made no significant difference at all at the certainty of the result. Trying different (and to me apparently clever) forms of weighing frequencies of strings etc actually gave worse results than not weighing them. Using a larger corpus might fix that though...

Secondly, I strongly believe it's best

By hefa , # Oct 24, 2006 5:22:49 PM

Sweet!!! Excellent widget for doing what it is supposed to do!

By AleksOD , # Oct 24, 2006 4:55:31 PM

The discrimination I was refering to was more like this (depending on how much you want to implement, see list):
  • U+0400 – U+04FF: Cyrillic (Russian, Ukranian, Serbian, Macedonian, Bulgarian ...)
  • U+0600 – U+06FF: Arabic (Arabics, Farsi, Jawi, Kurdish, Pashto, Sindhi, Urdu...)
  • U+0530 – U+058F: Armenian
  • U+0590 – U+05FF: Hebrew
  • ...
So if 90% of the query text is in the Hiragana range, one can assume Japanese and start processing with this bias (maybe even separating the non-JP text to identify that 10%).

Originally posted by hefa:

Yes, right now the languages are only available for one script each.
It made sense to me to have `Japanese(Katakana)` and `Japanese(Hiragana)` and `Japanese(Kanji)` and indeed `Japanese(Romaji)` as supplemental results. Isn't it just a matter of getting the right corpus ?

That would add to the current "Language:" result, a supplemental "Script:" computation.

Originally posted by hefa:

Separating Serbian written with latin characters from Croatian would be almost impossible.
Indeed, it's politics that mainly separates the languages. Plus analyzing dialects/vocabularies is beyond the widget's scope

By dantesoft , # Oct 24, 2006 12:23:45 PM

Thank you for your feedback, dantesoft.

I do do simple discrimination in advance based on the frequency of characters used. But the problem is that CJK has a tremendously lower frequency of use even for the most frequently used characters simply because there are so darn many characters in those languages. But I will try to figure out some way of weighing in this factor.

Yes, right now the languages are only available for one script each. This is for two reasons: 1. that's how the wikipedia for that language is written (I got the corpus I used for analyzing the languages from wikipedia entries). 2. where does it end? romanized japanese is japanese? japanese written with hangul? of course, the line can be drawn differently. But from what I understand, separating Serbian written with latin characters from Croatian would be almost impossible. Please correct me if that is wrong. Given a good enough corpus, I'm willing to add any language to this widget. :D

Well, "Chinese" is whatever the "Chinese" edition of wikipedia is written in. :smile: But that is Mandarin, afaik. :smile:

By hefa , # Oct 24, 2006 9:49:12 AM

I think you could do a simple discriminator in advance, based on the Unicode ranges. I mean, it's `obvious` it's some unknown language (short text in Latin alphabet) and Japanese.

I installed only Japanese fonts, it's pretty clear to me when the language is Chinese (white boxes appear :smile:) and Korean sure looks different.

Speaking of script, it seems to process Farsi only in Arabic script. Also, I like how it has Bosnian, Croatian, Serbian:

Originally posted by "The world is left to the young":

"На млађима свет остаје" is Serbian but "Na mlađima svet ostaje" is Bosnian or Croatian.

But I guess that's just politics stepping over the land of mathematics :D

BTW: Is "Chinese" (standard) Mandarin ? I see you list "Cantonese"...

By dantesoft , # Oct 24, 2006 7:43:07 AM

dantesoft: Japanese, Chinese, or Korean with mixed latin characters makes the widget go a bit bananas as you noticed... it's because the CJK languages use so darn many characters... have to work on it. :S AdSenseではヴィジェットの画面に写れるかなぁ…営業手法新発明!

By hefa , # Oct 24, 2006 3:43:20 AM

Feeleeng loocky? Inter a seerch und try it oooot! :up: Bunoos fur dueeng it ell lucelly. Bork bork bork!

PS:
AdSense では、手間も費用もかけずに広告収入を増やすことができます。
Dutch, naturally :smile:

By dantesoft , # Oct 23, 2006 7:01:21 PM

  1. Pages:
  2. previous
  3. 1
  4. 2

Copyright © 2001 - 2009 Opera Software. All rights reserved. About | Contact | Privacy | Disclaimer |