DOSAYGO Studio

Language Detection: Mapping Linguistic Patterns

By Cris, February 1, 2013

I needed to be able to detect the language of a web page to classify it for archiving and scraping. There was no reliable library for doing this at the time, so I wondered if I could achieve this natural task using something I loved and intuitively felt might apply: The Lempel-Ziv compression algorithm or, more accurately in this case, the tokens afforded via "LZ-factorization" of a text.

The output of this process would be a series of short tokens that were repeated in a text. I reasoned that, for a given human language, a subset of these would converge upon some kind of "fingerprint" of the language, because langauges tend to repeat such morphemes to convey meaning, and do so in distinctive ways. I also felt that this would be a good way to delimit one language from another, because the specific morpheme fingerprint would differ between languages — French would have the same distribution of morphemes as English for example.

I manually created per-language corpora for over 100 human languages, and calculated their per language fingerprints using LZ factorization, and then scored a given text based on the similarity between its LZ factorization and the fingerprint for each language, yielding a confidence score that was accurate to a single sentence. That is, if a text (a web page, for example) switched languages mid-paragraph, I could detect where one language ended and another begun with this method.