Neural Machine Translation (NMT) is developing very fast. Today, in order to assemble your translator, you do not need to have two higher educations. But in order to train the model, you need a large parallel corpus (a corpus in which the translation in the source language is associated with the sentence). In practice, we are talking about at least one million pairs of sentences. There is even a separate large area of IMF that explores methods for teaching language pairs with a small amount of data in electronic form (English Low Resource NMT).
We are collecting the Chuvash-Russian corps and at the same time we are looking at what can be done with the available data volume. In this example, a case of 90,000 offer pairs was used. The best result at the moment was given by the method of knowledge transfer (Eng. Transfer Learning), and it will be discussed in the article. The purpose of the article is to give a practical example of implementation that could easily be reproduced.
The training plan is as follows. We need to take a large (parent) building, train a neural model on it, and then train our daughter model. Moreover, the target language of translation will be the same: Russian. Intuitively, this can be compared to learning a second language. It is easier to learn, knowing one foreign language. It also looks like studying a narrow area of a foreign language, for example, the medical terminology of the English language: first you need to learn English in general.
As a parental corps, they tried to take 1 million pairs of sentences
from the English-Russian parallel corps and 1 million
from the Kazakh-Russian corps . There are 5 million sentences in the Kazakh data. Of these, only those with a compliance coefficient (third column) of more than 2 were taken. The Kazakh version gave slightly better results. It seems intuitively that this is understandable, since the Chuvash and Kazakh languages are more similar to each other. But in fact, this is not proven, and also greatly depends on the quality of the case. More details about the selection of the parental body can be found
in this article . About the daughter building of 90,000 pairs of offers, you can
find out and request sample data here.Now to the code. If you don’t have your own fast graphics card, you can train the model on the
Colab site. For training, we used the
Sockeye library. It is assumed that Python3 is already installed.
pip install sockeye
It may also be necessary to separately tinker with
MXNet , which is responsible for working with the video card. Colab needs additional library installed
pip install mxnet-cu100mkl
About neural networks, it is generally accepted that it is enough for them to feed the data as is, and they will figure it out themselves. But in reality this is not always the case. So in our case, the body needs to be pre-processed. First, we tokenize it so that it is easier for models to understand that “cat!” And “cat” are about the same thing. For example, just a python tokenizer will do.
from nltk.tokenize import WordPunctTokenizer def tokenize(src_filename, new_filename): with open(src_filename, encoding="utf-8") as src_file: with open(new_filename, "w", encoding="utf-8") as new_file: for line in src_file: new_file.write("%s" % ' '.join(WordPunctTokenizer().tokenize(line))) new_file.write("\n")
As a result, we feed pairs of sentences of the form
ӗ ҫ ӳ ӑӑ. ӑ ӑ ӑ ӑӗ, ӑ ӑӑӗ, ӑ ӗӗ -ӑ ӗӗҫ, ҫӗ ӗ ӗҫ ӑӑ ӑӑ, ҫ ӗ ӗ ӑ ӑ ӑӑ ӑ .
and
. , , , , , .
The output is the following tokenized offers:
ӗ ҫ ӳ ӑӑ . ӑ ӑ ӑ ӑӗ , ӑ ӑӑӗ , ӑ ӗӗ - ӑ ӗӗҫ , ҫӗ ӗ ӗҫ ӑӑ ӑӑ , ҫ ӗ ӗ ӑ ӑ ӑӑ ӑ .
and in Russian
. , , , , , .
In our case, we will need the combined dictionaries of the parent and child cases, so we will create common files:
cp kk.parent.train.tok kkchv.all.train.tok cat chv.child.train.tok >> kk.parent.train.tok cp ru.parent.train.tok ru.all.train.tok cat ru.child.train.tok >> ru.all.train.tok
since the further training of the child model takes place on the same dictionary.
Now a small but important digression. In MP, sentences are divided into atoms in the form of words and then operate on sentences as sequences of words. But this is usually not enough, because a huge tail is formed of words that occur once in the corpus. To build a probabilistic model for them is difficult. This is especially true for languages with developed morphology (case, gender, number). Both Russian and Chuvash are just such languages. But there is a solution. You can break the sentence into a lower level, into subwords. We used
Byte pair encoding. git clone https://github.com/rsennrich/subword-nmt.git
We get roughly such sequences of subwords
@@ ӗ ҫ ӳ@@ ӑӑ . @@ ӑ ӑ ӑ @@ ӑӗ , ӑ ӑӑ@@ ӗ , ӑ@@ @@ ӗӗ - ӑ@@ ӗ@@ ӗҫ , ҫӗ@@ ӗ ӗҫ@@ @@ ӑӑ ӑӑ , ҫ@@ @@ @@ ӗ ӗ @@ @@ @@ ӑ ӑ ӑӑ ӑ .
and
@@ @@ @@ . @@ @@ @@ @@ @@ , @@ , @@ @@ @@ @@ @@ @@ @@ @@ @@ @@ , , @@ @@ @@ @@ @@ @@ @@ , @@ @@ @@ @@ @@ .
It can be seen that affixes are well distinguished from words: Not @@ for a long time and good @@ that.
To do this, prepare bpe dictionaries
python subword-nmt/subword_nmt/learn_joint_bpe_and_vocab.py --input kkchv.all.train.tok ru.all.train.tok -s 10000 -o bpe.codes --write-vocabulary bpe.vocab.kkchv bpe.vocab.ru
And apply them to tokens, for example:
python subword-nmt/subword_nmt/apply_bpe.py -c bpe.codes --vocabulary bpe.vocab.kkchv --vocabulary-threshold 50 < kkchv.all.train.tok > kkchv.all.train.bpe !python subword-nmt/subword_nmt/apply_bpe.py -c bpe.codes --vocabulary bpe.vocab.ru --vocabulary-threshold 50 < ru.all.train.tok > ru.all.train.bpe
By analogy, you need to do for all files: training, validation and test parent and child models.
Now we turn directly to the training of the neural model. First you need to prepare general model dictionaries:
python -m sockeye.prepare_data -s kk.all.train.bpe -t ru.all.train.bpe -o kkru_all_data
Next, train the parent model. A
simple example is described in more detail
on the Sockeye page. Technically, the process consists of two steps: preparing data using previously created model dictionaries
python -m sockeye.prepare_data -s kk.parent.train.bpe -t ru.parent.train.bpe -o kkru_parent_data --source-vocab kkru_all_data/vocab.src.0.json --target-vocab kkru_all_data/vocab.trg.0.json
and the learning itself
python -m sockeye.train -d kkru_parent_data -vs kk.parent.dev.bpe -vt ru.parent.dev.bpe --encoder transformer --decoder transformer --transformer-model-size 512 --transformer-feed-forward-num-hidden 256 --transformer-dropout-prepost 0.1 --num-embed 512 --max-seq-len 100 --decode-and-evaluate 500 -o kkru_parent_model --num-layers 6 --disable-device-locking --batch-size 1024 --optimized-metric bleu --max-num-checkpoint-not-improved 10
Training at Colab facilities takes about a day. When the model’s training is completed, you can translate it with the help of
python -m sockeye.translate --input kk.parent.test.bpe -m kkru_parent_model --output ru.parent.test_kkru_parent.bpe
To train the child model
python -m sockeye.prepare_data -s chv.child.train.bpe -t ru.child.train.bpe -o chvru_child_data --source-vocab kkru_all_data/vocab.src.0.json --target-vocab kkru_all_data/vocab.trg.0.json
The training start code looks like this
python -m sockeye.train -d chvru_child_data -vs chv.child.dev.bpe -vt ru.child.dev.bpe --encoder transformer --decoder transformer --transformer-model-size 512 --transformer-feed-forward-num-hidden 256 --transformer-dropout-prepost 0.1 --num-embed 512 --max-seq-len 100 --decode-and-evaluate 500 -o ruchv_150K_skv_dev19_model --num-layers 6 --disable-device-locking --batch-size 1024 --optimized-metric bleu --max-num-checkpoint-not-improved 10 --config kkru_parent_model/args.yaml --params kkru_parent_model/params.best
Parameters are added to indicate that the configuration and weights of the parent model should be used as the starting point. Details
in the example with retraining from Sockeye . Learning a child model converges in about 12 hours.
To summarize, compare the results. The usual machine translation model yielded 24.96 BLEU quality, while the knowledge transfer model was 32.38 BLEU. The difference is visible also visually on examples of translations. Therefore, while we continue to assemble the case, we will use this model.