👩🏼‍🤝‍👩🏻 ♉️ 😖 Knowledge Transfer and Neural Machine Translation in Practice 🖤 👨🏻‍🏭 🕴️

Neural Machine Translation (NMT) is developing very fast. Today, in order to assemble your translator, you do not need to have two higher educations. But in order to train the model, you need a large parallel corpus (a corpus in which the translation in the source language is associated with the sentence). In practice, we are talking about at least one million pairs of sentences. There is even a separate large area of IMF that explores methods for teaching language pairs with a small amount of data in electronic form (English Low Resource NMT).

We are collecting the Chuvash-Russian corps and at the same time we are looking at what can be done with the available data volume. In this example, a case of 90,000 offer pairs was used. The best result at the moment was given by the method of knowledge transfer (Eng. Transfer Learning), and it will be discussed in the article. The purpose of the article is to give a practical example of implementation that could easily be reproduced.

The training plan is as follows. We need to take a large (parent) building, train a neural model on it, and then train our daughter model. Moreover, the target language of translation will be the same: Russian. Intuitively, this can be compared to learning a second language. It is easier to learn, knowing one foreign language. It also looks like studying a narrow area of a foreign language, for example, the medical terminology of the English language: first you need to learn English in general.

As a parental corps, they tried to take 1 million pairs of sentences from the English-Russian parallel corps and 1 million from the Kazakh-Russian corps . There are 5 million sentences in the Kazakh data. Of these, only those with a compliance coefficient (third column) of more than 2 were taken. The Kazakh version gave slightly better results. It seems intuitively that this is understandable, since the Chuvash and Kazakh languages are more similar to each other. But in fact, this is not proven, and also greatly depends on the quality of the case. More details about the selection of the parental body can be found in this article . About the daughter building of 90,000 pairs of offers, you can find out and request sample data here.

Now to the code. If you don’t have your own fast graphics card, you can train the model on the Colab site. For training, we used the Sockeye library. It is assumed that Python3 is already installed.

pip install sockeye

It may also be necessary to separately tinker with MXNet , which is responsible for working with the video card. Colab needs additional library installed

 pip install mxnet-cu100mkl

About neural networks, it is generally accepted that it is enough for them to feed the data as is, and they will figure it out themselves. But in reality this is not always the case. So in our case, the body needs to be pre-processed. First, we tokenize it so that it is easier for models to understand that “cat!” And “cat” are about the same thing. For example, just a python tokenizer will do.

 from nltk.tokenize import WordPunctTokenizer def tokenize(src_filename, new_filename): with open(src_filename, encoding="utf-8") as src_file: with open(new_filename, "w", encoding="utf-8") as new_file: for line in src_file: new_file.write("%s" % ' '.join(WordPunctTokenizer().tokenize(line))) new_file.write("\n")

As a result, we feed pairs of sentences of the form

   ӗ  ҫ ӳ ӑӑ. ӑ ӑ ӑ ӑӗ, ӑ ӑӑӗ,   ӑ  ӗӗ -ӑ ӗӗҫ,  ҫӗ ӗ ӗҫ ӑӑ ӑӑ, ҫ ӗ ӗ   ӑ ӑ ӑӑ ӑ .

and

      .  , ,       , ,    ,        .

The output is the following tokenized offers:

   ӗ  ҫ ӳ ӑӑ . ӑ ӑ ӑ ӑӗ , ӑ ӑӑӗ ,   ӑ  ӗӗ  - ӑ ӗӗҫ ,  ҫӗ ӗ ӗҫ ӑӑ ӑӑ , ҫ ӗ ӗ   ӑ ӑ ӑӑ ӑ  .

and in Russian

       .   ,  ,        ,  ,     ,         .

In our case, we will need the combined dictionaries of the parent and child cases, so we will create common files:

 cp kk.parent.train.tok kkchv.all.train.tok cat chv.child.train.tok >> kk.parent.train.tok cp ru.parent.train.tok ru.all.train.tok cat ru.child.train.tok >> ru.all.train.tok

since the further training of the child model takes place on the same dictionary.

Now a small but important digression. In MP, sentences are divided into atoms in the form of words and then operate on sentences as sequences of words. But this is usually not enough, because a huge tail is formed of words that occur once in the corpus. To build a probabilistic model for them is difficult. This is especially true for languages with developed morphology (case, gender, number). Both Russian and Chuvash are just such languages. But there is a solution. You can break the sentence into a lower level, into subwords. We used Byte pair encoding.

 git clone https://github.com/rsennrich/subword-nmt.git

We get roughly such sequences of subwords

 @@   ӗ  ҫ ӳ@@  ӑӑ . @@ ӑ ӑ ӑ @@ ӑӗ , ӑ ӑӑ@@ ӗ ,   ӑ@@  @@  ӗӗ  - ӑ@@  ӗ@@ ӗҫ ,  ҫӗ@@  ӗ ӗҫ@@ @@  ӑӑ ӑӑ , ҫ@@ @@ @@  ӗ ӗ @@ @@  @@  ӑ ӑ ӑӑ ӑ  .

and

 @@    @@  @@   . @@  @@ @@ @@ @@  , @@  , @@ @@  @@    @@  @@ @@  @@ @@ @@ @@  ,  ,  @@  @@ @@ @@  @@ @@ @@  ,       @@ @@  @@ @@ @@  .

It can be seen that affixes are well distinguished from words: Not @@ for a long time and good @@ that.
To do this, prepare bpe dictionaries

 python subword-nmt/subword_nmt/learn_joint_bpe_and_vocab.py --input kkchv.all.train.tok ru.all.train.tok -s 10000 -o bpe.codes --write-vocabulary bpe.vocab.kkchv bpe.vocab.ru

And apply them to tokens, for example:

 python subword-nmt/subword_nmt/apply_bpe.py -c bpe.codes --vocabulary bpe.vocab.kkchv --vocabulary-threshold 50 < kkchv.all.train.tok > kkchv.all.train.bpe !python subword-nmt/subword_nmt/apply_bpe.py -c bpe.codes --vocabulary bpe.vocab.ru --vocabulary-threshold 50 < ru.all.train.tok > ru.all.train.bpe

By analogy, you need to do for all files: training, validation and test parent and child models.

Now we turn directly to the training of the neural model. First you need to prepare general model dictionaries:

 python -m sockeye.prepare_data -s kk.all.train.bpe -t ru.all.train.bpe -o kkru_all_data

Next, train the parent model. A simple example is described in more detail on the Sockeye page. Technically, the process consists of two steps: preparing data using previously created model dictionaries

 python -m sockeye.prepare_data -s kk.parent.train.bpe -t ru.parent.train.bpe -o kkru_parent_data --source-vocab kkru_all_data/vocab.src.0.json --target-vocab kkru_all_data/vocab.trg.0.json

and the learning itself

 python -m sockeye.train -d kkru_parent_data -vs kk.parent.dev.bpe -vt ru.parent.dev.bpe --encoder transformer --decoder transformer --transformer-model-size 512 --transformer-feed-forward-num-hidden 256 --transformer-dropout-prepost 0.1 --num-embed 512 --max-seq-len 100 --decode-and-evaluate 500 -o kkru_parent_model --num-layers 6 --disable-device-locking --batch-size 1024 --optimized-metric bleu --max-num-checkpoint-not-improved 10

Training at Colab facilities takes about a day. When the model’s training is completed, you can translate it with the help of

 python -m sockeye.translate --input kk.parent.test.bpe -m kkru_parent_model --output ru.parent.test_kkru_parent.bpe

To train the child model

 python -m sockeye.prepare_data -s chv.child.train.bpe -t ru.child.train.bpe -o chvru_child_data --source-vocab kkru_all_data/vocab.src.0.json --target-vocab kkru_all_data/vocab.trg.0.json

The training start code looks like this

 python -m sockeye.train -d chvru_child_data -vs chv.child.dev.bpe -vt ru.child.dev.bpe --encoder transformer --decoder transformer --transformer-model-size 512 --transformer-feed-forward-num-hidden 256 --transformer-dropout-prepost 0.1 --num-embed 512 --max-seq-len 100 --decode-and-evaluate 500 -o ruchv_150K_skv_dev19_model --num-layers 6 --disable-device-locking --batch-size 1024 --optimized-metric bleu --max-num-checkpoint-not-improved 10 --config kkru_parent_model/args.yaml --params kkru_parent_model/params.best

Parameters are added to indicate that the configuration and weights of the parent model should be used as the starting point. Details in the example with retraining from Sockeye . Learning a child model converges in about 12 hours.

To summarize, compare the results. The usual machine translation model yielded 24.96 BLEU quality, while the knowledge transfer model was 32.38 BLEU. The difference is visible also visually on examples of translations. Therefore, while we continue to assemble the case, we will use this model.

Knowledge Transfer and Neural Machine Translation in Practice

More articles: