X
    Categories: 2022

Meta’s Multilingual Translation Machine Struggles with Armenian, Greek, Oromo, and Others: Here’s Why

TECH TIMES
July 7 2022

The owner of Facebook, Instagram, and WhatsApp, Meta Properties released a 190-page opus describing its latest efforts when it comes to machine translation. Although it was capable of translating 202 languages, there were a few of them that the translation machine really struggled on.

According to the story by ZDnet, the opes described how the state-of-the-art translation machine was able to learn different languages that were considered "low resource" languages. The opus described the company's latest effort in machine learning for languages with low resources.

The languages that were considered "low resource" were the following::02

  • West Central Oromo – spoken in the Oromia state of Ethiopia
  • Tamasheq – spoken by several parts of Northern Africa and even in Algeria

  • Waray – spoken by the Waray people of the Philippines

The report coming from Meta researchers was uploaded to Facebook's AI research website. The study also came with a blog to provide more thorough information to better understand what Meta is doing.

As written in their mission statement the "broadly accessible machine translation system" is capable of supporting around 130 languages while they aim to increase that number all the way up to 200.

As reported by ZDNet, Meta is still open-sourcing its data sets as well as neural network model code on GitHub. They are also offering a massive $200,000 award to those that decide to use their technology outside.

The company has also partnered with the owners of Wikipedia, the Wikimedia Foundation, in order to provide a much better translation of Wikipedia articles. ZDNet notes that Meta uses automated methods in order to compile a data set of different bilingual sentence pairs for all their target languages.

The sets included some interesting statistics including the fact that there are 1220 language pairs in total or about 2440 directions for training. The 2440 directions equal to over 18 billion total sentence pairs and the majority of the pairs actually have much fewer than a million sentences and are considered low-resource directions.

The authors reportedly use the data to be able to train the NLLB neural net as well as employ a particular hand-crafted data set of transactions that are expected to be built by human translators.

Read Also: Netflix Continues to Lose Subscribers with Shrinking Market Share! Amazon Prime Video 1% Behind?

The whole human element called the NLLB-SEED data set turns out to be quite important. As written "despite the considerably large size of publicly available training data, training on NLLB-Seed leads to markedly higher performance on average."

Meta is not the only one trying to chew on these gigantic data sets as Google scientists could also be unveiling something similar when it comes to multilingual effort. The research had low results when trying to study Greek, Armenian, Oromo, and other languages.

Mike Maghakian: