Overview of Open Typo Correction Solutions

Each user has ever had typos when writing search queries. The absence of mechanisms that correct typos leads to the issuance of irrelevant results, or even to their absence. Therefore, to make the search engine more user-oriented, error correction mechanisms are built into it.

The task of correcting typos, at first glance, seems pretty straightforward. But if you start from a variety of errors, implementing a solution can be difficult. In general, typo correction is divided into context-independent and context-sensitive (where the dictionary environment is taken into account) . In the first case, errors are corrected for each word separately, in the second - taking into account the context (for example, for the phrase "she went home" in the context-independent case, correction occurs for each word separately, where we can get "she went home", and in the second case, the correct correction will give "she went home").

In the search queries of a Russian-speaking user, four main groups of errors can be distinguished only for context-independent correction [ 1 ]:
1) errors in the words themselves (say mr → hi), this category includes all kinds of omissions, insertions and permutations of letters - 63.7%,
2) continuous spelling of words - 16.9%,
3) distorted layout (ghbdtn → hi) - 9.7%,
4) transliteration (privet → hi) - 1.3%,
5) mixed errors - 8.3%.

Users make typos in approximately 10-15% of cases. At the same time, 83.6% of requests have one error, 11.7% - two, 4.8% - more than three. Context is important in 26% of cases.

These statistics were compiled on the basis of a random sample from the Yandex daily log back in 2013 based on 10,000 queries. In the public domain there is a much earlier presentation from Yandex for 2008, which shows a similar distribution of statistics [ 2 ]. From this we can conclude that the distribution of varieties of errors for search queries, on average, does not change over time.

In general, the typo correction mechanism is based on two models: the error model and the language model. Moreover, for a context-independent correction, only the error model is used, and in a context-dependent correction, two are used at once. The error model is usually either the editorial distance (Levenshtein, Damerau-Levenshtein distance, various weighting factors, methods similar to Soundex, etc. can also be added here - in this case, the distance is called weighted) , or the Bril-Moore model, which works on the probabilities of transitions from one line to another. Bril and Moore position their model as more advanced, however, at one of the latest SpellRuEval competitions, the Damerau-Levenshtein approach showed a better result [ 3 ], despite the fact that the Damerau-Levenshtein distance (clarification is unweighted) does not use a priori information about the off-guard statistics . This observation is especially indicative if the same training texts were used for different implementations of auto-correctors in the DeepPavlov library.

Obviously, the possibility of a context-sensitive correction complicates the construction of the auto-corrector, since in addition to the error model, the need for a language model is added. But if you pay attention to the statistics of typos, then ¾ all incorrectly written search queries can be corrected without context. This suggests that the benefit of at least a context-independent auto-corrector can be very significant.

Also, a context-sensitive correction for correcting typos in requests is very resource-intensive. For example, in one of Yandex's speeches, the list of pairs for correcting typos (bigrams) of words differed 10 times in comparison with the number of words (unigrams) , what then about trigrams? Obviously, this substantially depends on the variability of the queries. It looks a little strange when the auto-corrector takes up half the memory of the proposed company product, the purpose of which is not focused on solving the spelling problem. So the issue of implementing context-sensitive corrections in the search engines of software products can be very controversial.

At first glance, one gets the impression that there are many ready-made solutions for any programming language that can be used without much immersion in the details of the operation of algorithms, including in commercial systems. But in practice, the development of their solutions continues. For example, relatively recently, Joom made its own decision to correct typos using language models for search queries [ 4 ]. Is the situation really difficult with the availability of turnkey solutions? To this end, as far as possible, a broad overview of existing solutions has been made. Before starting the review, we will decide on how the quality of the auto-corrector is checked.

Quality control

The issue of checking the quality of the auto-corrector is very controversial. One simple verification approach is through Precision and Recall. In accordance with the ISO standard, accuracy and completeness are complemented by correctness (in English “corectness”).

Completeness (Recall) is calculated as follows: a list of correct words is supplied to the auto-corrector (Total_list_true), and the number of words that the auto-corrector considers correct (Spellchecker_true), divided by the total number of correct words (Total_list_true), will be considered as completeness.

R e c a l l = S p e l l c h e c k e r_{t} r u e o v e r T o t a l_{l} i s t_{t} r u e

$Recall = {Spellchecker \ _true \ over Total \ _list \ _true}$

To determine the accuracy (Precision), a list of incorrect words (Total_list_false) is supplied to the input of the auto-corrector, and the number of words that the auto-corrector considers incorrect (Spell_checker_false), divided by the total number of incorrect words (Total_list_false), is defined as accuracy.

P r e c i s i o n = S p e l l c h e c k e r_{f} a l s e o v e r T o t a l_{l} i s t_{f} a l s e

$Precision = {Spellchecker \ _false \ over Total \ _list \ _false}$

How generally these metrics are informative and how they can be useful, each determines independently. After all, in fact, the essence of this test is that the word is checked for in the training dictionary. Correctness can be considered a more obvious metric, according to which the auto- corrector for each word from the test set of incorrect words forms a list of replacement candidates for which this incorrect word can be corrected (keep in mind that there may be words that are not contained in the training dictionary) . Suppose the size of such a list of replacement candidates is 5. Based on the fact that the size of the list is 5, 6 groups will be formed, in one of which we will put our original wrong word according to the following principle: in the 1st group - if in the list of replacement candidates, the correct word that we assume is the 1st, in the 2nd if it is the 2nd, etc., and in the last group, if the supposed correct word is not in the list of replacement candidates. Of course, the more words fell into the 1st group and the less into the 6th, the better the auto-corrector works.

The authors considered the approach considered in the article [ 5 ], in which context-independent auto-correctors with a bias to the ISO standard were compared. There are also links to other ways of assessing quality.

On the one hand, this approach is not based on gross statistics, which can be based on the Brill – Moore error model [ 6 ] or the Damerau – Levenshtein weighted distance error model.

To check the quality of the work of the context-independent auto-corrector, we created our own typo generator, which generated typos of the wrong layout and spelling typos based on the statistics on typos submitted by Yandex. For spelling errors, random insertions, replacements, deletions, permutations were generated, and the number of errors also varied in accordance with these statistics. For errors in the distorted layout, the correct word was changed character-by-character entirely in accordance with the character translation table.

Next, a series of experiments was conducted for the entire list of words in the training dictionary (the words of the training dictionary were corrected for incorrect ones in accordance with the probability of a particular typo) . On average, the auto-corrector corrects words correctly in 75% of cases. Without a doubt, this number will be reduced when the learning dictionary is replenished with words close in editorial distance, a large variety of word forms. This problem can be solved by supplementing with language models, but it should be borne in mind that the amount of required resources will increase significantly.

Ready-made solutions

Consideration of ready-made solutions was carried out with a focus on their own use, and priority was given to auto-correctors that satisfy three criteria:
1) implementation language,
2) type of license,
3) updatability.

In product development, the Java language is considered one of the most popular, so priority was given to it when searching for libraries. Of the licenses relevant: MIT, Public, Apache, BSD. Updatability - no more than 2 years from the last update. During the search, additional information was recorded, for example, about the supported platform, the required additional programs, application features, possible difficulties during the first use, etc. Links with basic and useful resources to sources are given at the end of the article. In general, if not limited to the above criteria, the number of existing solutions is large. Let's take a brief look at the main ones, and pay more attention to only a few.

Historically, one of the oldest auto- correctors is Ispell (International Spell), written in assembler in 1971, later transferred to C and uses the Damerau-Levenshtein editorial distance as a model of errors. For him, there is even a dictionary in Russian. Subsequently, he was replaced by two auto- correctors HunSpell (formerly MySpell ) and Aspell . Both are implemented in C ++ and are distributed under GPL licenses. HunSpell is also covered by the GPL / MPL and is used to correct typos in OpenOffice, LibreOffice, Google Chrome, and other tools.

There are a lot of JS solutions for the Internet and browsers (this includes: nodehun-sentences , nspell , node-markdown-spellcheck , Proofreader , Spellcheck-API - a group of solutions based on the Hunspell auto- corrector ; grunt-spell - under NodeJS; yaspeller- ci - wrapper for Yandex.Speller auto- corrector , distributed under MIT; rousseau - Lightweight proofreader in JS - used for spelling).

The category of paid solutions includes: Spellex ; Source Code Spell Checker - as a desktop application; for JS: nanospell ; for Java: Keyoti RapidSpell Spellchecker , JSpell SDK , WinterTree (WinterTree can even buy source code for $ 5000).

The auto-corrector of Peter Norvig is widely used. Its Python code is publicly available in the article “ How to Write a Spelling Corrector ” [ 7 ]. Based on this simple solution, auto -correctors were built in other languages, for example: Norvig-spell-check , scala-norvig-spell-check (on Scala), toy-spelling-corrector - Golang Spellcheck (on GO), pyspellchecker (on Python) . Of course, here we are not talking about language models and context-sensitive correction.

For text editors, in particular for VIM, vim-dialect , vim-ditto are made - distributed under a public license; for Notepad ++ developed by DspellCheck in C ++, GPL license; Emacs has a tool for automatically detecting languages when printing, called guess-language , distributed under a public license.

There are separate services from the search giants: Yandex.Speller - from Yandex, we mentioned above the wrapper for it, google-api-spelling-java (respectively, from Google).

Free libraries for Java: languagetool (licensed under LGPL), integrates with Lucene text search library and allows the use of language models, Java version 8 is required to work; Jazzy (an analogue of Aspell ) is licensed under LGPLv2 and has not been updated since 2005, and in 2013 was ported to GitHub. In the likeness of this auto-corrector, a separate solution has been made [ 8 ]; Jortho (Java Orthography) is distributed under the GPL and allows free use exclusively for non-commercial purposes, for commercial ones - for an additional fee; Jaspell (licensed under BSD and has not been updated since 2005); Open Source Java Suggester - has not been updated since 2013, distributed by SoftCorporation LLC and allows commercial use; LuceneSpellChecker is a Lucene autocorrector written in Java and distributed under the Apache license.

For a long time, Wolf Garbe dealt with typos, he proposed the algorithms SymSpell (under the MIT license) and LinSpell (under the LGPL) with C # implementations [ 9 ] that use the Damerau-Levenshtein distance for the error model. The peculiarity of their implementation is that at the stage of generating possible errors for the input word, only deletions are used, instead of all kinds of deletions, inserts, replacements and permutations. In comparison with the implementation of Peter Norwig's auto-corrector, both algorithms work faster due to this, while the increase in speed increases significantly if the Damerau-Levenshtein distance becomes more than two. Also, due to the fact that only deletions are used, the dictionary formation time is reduced. The difference between the two algorithms is that LinSpell is more economical in memory and slower in search speed, SymSpell - vice versa. In a later version, SymSpell fixes split-spelling errors. Language models are not used.

The most recent and promising for use auto-correctors working with language models and correcting context-sensitive typos include Yandex.Speller , JamSpell [ 10 ], DeepPavlov [ 11 ]. The last 2 are distributed freely: JamSpell (MIT), DeepPavlov (under Apache).

Yandex.Speller uses the CatBoost algorithm, works with several languages and corrects all kinds of errors, even taking into account the context. The only solution found that corrects incorrect layout errors and transliteration. The solution has versatility, which makes it popular. Its disadvantage is that it is a remote service, and about restrictions and conditions of use can be found here [ 12 ]. The service works with a limited number of languages, you cannot add words yourself and manage the correction process. In accordance with the resource [ 3 ] according to the results of the RuSpellEval competition, this auto-corrector showed the highest quality of corrections. JamSpell is the fastest known auto- corrector (C ++ implementation), there are ready-made binders for other languages. Corrects errors only in the words themselves and works with a specific language. It is impossible to use the solution at the level of unigrams and bigrams. To obtain acceptable quality, a large training text is required.
DeepPavlov has some good practices, however, the integration of these solutions and the subsequent support in their own product can cause difficulties, because working with them requires connecting a virtual environment and using an earlier version of Python 3.6. DeepPavlov provides a choice of three ready-made implementations of auto- correctors , in two of which Bril-Moore error models are applied and in two language models. It only corrects spelling errors, and the variant with the error model based on the Damerau-Levenshtein distance can correct errors of spelling.

I’ll mention another one of the modern approaches to correcting typos, which is based on the use of vector representations of words (Word Embeddings). Its advantage is that on it you can build an auto-corrector to correct words based on context. More details about this approach can be found here [ 13 ]. But in order to use it to correct typos of search queries, you need to accumulate a large log of queries. In addition, the model itself can be quite capacious in terms of memory consumption, which will affect the complexity of integration into the product.

Naumen Choice

Of the ready-made solutions for Java, an auto-corrector from Lucene was chosen (distributed under a license from Apache). Allows you to correct typos in words. The learning process is quick: for example, the formation of a special data structure of the dictionary - the index for 3 million lines was 30 seconds on the Intel Core i5-8500 3.00GHz processor, 32 Gb RAM, Lucene 8.0.0. In earlier versions, the time may be 2 times longer. The size of the training dictionary is 3 million lines (~ 73 Mb txt-file), the index structure is ~ 235 Mb. For the error model, you can choose the distance of Jaro-Winkler, Levenshtein, Damerau-Levenshtein, N-Gram, if necessary, you can add your own. If necessary, it is possible to connect a language model [ 14 ]. Models have been known since 2001, but their comparison with well-known modern solutions in the public domain was not found. The next step is to check their work.

The resulting Lucene-based solution only corrects errors in the words themselves. It is not difficult to add correction of a distorted keyboard layout to any such solution by means of an appropriate translation table, thereby reducing the possibility of irrelevant output to 10% (in accordance with typo statistics). In addition, it is easy to add separate writing of fused 2 words and transliteration.

The main disadvantages of the solution are the need for knowledge of Java, the lack of detailed use cases and detailed documentation, which is reflected in a decrease in the speed of development of solutions for Data-Science specialists. In addition, typos with a Damerau-Levenshtein distance of more than 2 are not corrected. Again, if we proceed from the misrepresentation statistics, more than 2 errors in a word occur less frequently than in 5% of cases. Is the need to complicate the algorithm, in particular, an increase in memory consumption, justified? It already depends on the case of the customer. If there are additional resources, then why not use them?

Key resources for affordable auto-correctors :

References

Source: https://habr.com/ru/post/463683/

All Articles

Overview of Open Typo Correction Solutions

Quality control

Ready-made solutions

Naumen Choice

More articles: