Why why
For a long time, various kinds of machine learning algorithms gained popularity. Also, thanks to large companies that are driving technological progress, many opensource products have appeared. One of them is Fasttext, which will be discussed below.
Fasttext - development from Facebook. The main objective of the program is the classification of the text. Text classification may be needed for:
- combining textual information in groups of "similarity" (news on one topic)
- grouping text with similar topics into one group (news about cars)
- search for information that may be spam
- clickbait search
- ...
In fact, there are a lot of options, and listing everything makes no sense, the idea should be clear.
First training
On the library page there is a step-by-step
installation and
first training instruction . I will not dwell on them.
bunsThey also have ready-made models in different languages for classification
here. Library setup
The problem of training is the individuality of the parameters. There are no parameters that give guaranteed excellent results. You can find on the Internet a ton of (
or not so ) articles with examples of parameters and they may not be suitable for you, as they will give an unsatisfactory result.
Only empirically can you choose the parameters that suit you. Below is a list of those that significantly affect the result:
-
dim - dimension controls the size of vectors (
butter ): the larger they are, the more information they can capture, but this requires more data. But if there is too much data, the training process will be slower. The default is 100 measurements. Start with 150 and choose the optimal value for you.
-
lr - learning speed. If the parameter is very small, then the model becomes more sensitive to the text and may not distinguish similar texts, but if the parameter is very large, on the contrary, it can “say” that the texts are similar, although in reality this will not be so. Start with 0.1 (Default 0.05).
-
epoch - number of eras. This is the number of passes according to your data. More - better (but, alas, not always). This increases the training time. Start at 150 (Default is 5).
-
learning model . Read the description from Facebook. It is quite clear.
-
loss - how the comparison will occur. Everything here is very individual and depends on the data.
small digressionIt is very cool that even without sufficient knowledge in the classification of texts and internal mechanisms of the neural network, you can get a working model.
Text preparation
The input text is also important. The better the text, the better the information from the model. Basic rules for preparing text for training:
- delete all tags
- cast to lowercase
- remove punctuation characters
- remove hash tags, links
- exclude stop words
- exclude small words (1,2,3 characters. here everyone decides for his data)
Some write that you can simply drive text into a model and train. This option did not suit me. I am inclined to believe that without preprocessing a poor-quality model is obtained.
Preparation of text for classification
The same rules apply here, but experience has shown that these rules can be supplemented by lemmatization or stamming. With them, the results can be significantly improved (
or worsened ). In addition, when you already have formed clusters, do not forget that clustering algorithms must also be applied to these clusters, but very carefully, as you can collapse a similar topic into one cluster. This is very evident in sports: the model understands that the news is from football. But it’s very difficult to get the Spanish Championship to distinguish the model from the Italian Championship.
Programming language
more than trueAs it was said in the Family Guy: “Yes, nobody cares ”
To train the model, you can choose both PHP (took it, since most of the sites are written on it) and Python (there is a library for it). But there is a
very funny moment . You still have to train the model by running fasttext from the command line if the training time is expensive for you. So it doesn’t matter on what to write the code for training (what’s convenient for that, write on that).
As for the clustering mechanism, it’s a bit more complicated (
or simpler ). If you like bicycles (
control all processes yourself and you need a flexible control mechanism ) write in php (if the site is in php). If you do not want to write libraries and there is a choice of language, then it is probably better to take Python. I did not notice a significant difference in speed (in the speed of the code, and not in the speed of writing it). It's up to you.
Instead of a conclusion
I have a model that is built solely on news content over the past few days. The size of the words in it is about
40,000 . You can
play with her. But, keep in mind that:
- This is not a universal model. She trains only on news content.
- the model does not contain all the news from the database, but only an editorial (this is enough to solve the task). This means that the model can give a low percentage on similar news.