Home » Tika Language Detection

Tika Language Detection

by Online Tutorials Library

Tika Language Detection

Tika can identify language of any document or piece of text. It is useful while extracting text from document formats which do not include language information in their metadata.

Tika uses LanguageProfile and Language-Identifier classes to matching ISO 639 language code.

Tika can detect 18 of the 184 currently registered ISO 639-1 languages.

ISO 639 is a set of standards defined by the International Organization for Standardization ( ISO ).

Tika is able to detect various language including english, german, Italian etc. See the following table.

Code name Language
da Danish
de German
et Estonian
el Greek
en English
es Spanish
fi Finnish
fr French
hu Hungarian
is Icelandic
it Italian
nl Dutch
no Norwegian
pl Polish
pt Portuguese
ru Russian
sv Swedish
th Thai

Language Detection in Tika

The following image, shows the key components of language detection process.

Tika Language Detection

The org.apache.tika.language package contains all the required classes to detect document or text language. Lets see an example.

Tika Language Detection Example

Output:

Language code is : en  

You may also like