Pre-trained Python model for easy language identification

Available options

Technically, you have three options when it comes to identifying the language of a given text.

  1. You can manually label the data, but this can easily get out of hand volume increases.
  2. You can train your own classifier, but this implies a long process of vectorizing the data and tuning the model (not to mention you need a labeled data set in case you are working with a supervised method).
  3. You can use an open-source pre-trained model.

Solution

I will focus on langid throughout this article, though you should also check out Facebook’s fastText.

Step 1

Install langid using your favorite package management system.

% pip install langid

Step 2

Open up a Python file and do the following:

# Import model
import langid
# Declare the texts you want to classify
texts = [
'My name is Stanley', # English
'Me llamo Oscar', # Spanish
'Mein name ist Dwight' # German
]
# Classify texts
for i, text in enumerate(texts):
# Use the `classify()` method
pred = langid.classify(text)
# Output
print(f'Prediction for text {i}: {pred}')
> Prediction for text 0: ('en', -34.635329246520996)
> Prediction for text 1: ('es', -12.523486137390137)
> Prediction for text 2: ('de', -24.241767406463623)

Closing remarks

Language identification is easy if you know where to look. Current open-source solutions have really good accuracy and low variance, so consider using them next time you have to filter out content written in other languages.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store