New hint search service on hh.ru

Search hints are great. How often do we type the full site address in the address bar? And the name of the product in the online store? For such short queries, typing a few characters is usually enough if the search hints are good. And if you do not have twenty fingers or incredible typing speed, then you will surely use them.
In this article, we will talk about our new hh.ru search hint service, which we did in the previous issue of the School of Programmers .

The old service had a number of problems:

he worked on hand-selected popular user queries;
could not adapt to changing user preferences;
could not rank queries that are not included in the top;
did not correct typos.

In the new service, we fixed these shortcomings (while adding new ones).

Dictionary of popular queries

When there are no hints at all, you can manually select the top-N queries of users and generate hints from these queries using the exact occurrence of words (with or without order). This is a good option - it is easy to implement, gives good accuracy of prompts and does not experience performance problems. For a long time, our sajest worked like that, but a significant drawback of this approach is the insufficient completeness of issuance.

For example, the request “javascript developer” didn’t fall into such a list, so when we enter “javascript times” we have nothing to show. If we supplement the request, taking into account only the last word, we will see "javascript handyman" in the first place. For the same reason, it will not be possible to implement error correction more difficult than the standard approach with finding the closest words by Damerau-Levenshtein distance.

Language model

Another approach is to learn to evaluate the probabilities of queries and to generate the most probable continuations for a user query. To do this, use language models - a probability distribution on a set of word sequences.

word_count

Since user requests are mostly short, we did not even try neural network language models, but limited ourselves to n-gram:

P (w_{1} d o t s w_{m}) = p r o d_{i = 1}^{m} P (w_{i} | w_{1} d o t s w_{i - 1}) a p p r o x p r o d_{i = 1}^{m} P (w_{i} | w_{i - (n - 1)} d o t s w_{i - 1})

$P (w_1 \ dots w_m) = \ prod_ {i = 1} ^ mP (w_i | w_1 \ dots w_ {i-1}) \ approx \ prod_ {i = 1} ^ mP (w_i | w_ {i- ( n-1)} \ dots w_ {i-1})$

As the simplest model, we can take the statistical definition of probability, then

P (w_{i} | w_{1} d o t s w_{i - 1}) = f r a c c o u n t (w_{1} d o t s w_{i}) c o u n t (w_{1} d o t s w_{i - 1})

$P (w_i | w_1 \ dots w_ {i-1}) = \ frac {count (w_1 \ dots w_i)} {count (w_1 \ dots w_ {i-1})}$

However, such a model is not suitable for evaluating queries that were not in our sample: if we did not observe the 'junior developer java', then it turns out that

P (t e x t j u n i o r d e v e l o p e r j a v a) = f r a c c o u n t (t e x t j u n i o r d e v e l o p e r j a v a) c o u n t (t e x t j u n i o r d e v e l o p e r) = 0

$P (\ text {junior developer java}) = \ frac {count (\ text {junior developer java})} {count (\ text {junior developer})} = 0$

To solve this problem, you can use various models of smoothing and interpolation. We used Backoff:

P_{b o} (w_{n} | w_{1} d o t s w_{n - 1}) = b e g i n c a s e s P (w_{n} | w_{1} d o t s w_{n - 1}), c o u n t (w_{1} d o t s w_{n - 1}) > 0 a l p h a P_{b o} (w_{n} | w_{2} d o t s w_{n - 1}), c o u n t (w_{1} d o t s w_{n - 1}) = 0 e n d c a s e s

$P_ {bo} (w_n | w_1 \ dots w_ {n-1}) = \ begin {cases} {P} (w_n | w_1 \ dots w_ {n-1}), count (w_1 \ dots w_ {n- 1})> 0 \\ \ alpha {P_ {bo}} (w_n | w_2 \ dots w_ {n-1}), count (w_1 \ dots w_ {n-1}) = 0 \ end {cases}$

a l p h a = f r a c P (w_{1} d o t s w_{n - 1}) 1 - s u m_{w} P_{b o} (w | w_{2} d o t s w_{n - 1})

$\ alpha = \ frac {P (w_1 \ dots w_ {n-1})} {1 - \ sum_wP_ {bo} (w | w_2 \ dots w_ {n-1})}$

Where P is the smoothed probability

w_{1} . . . w_{n - 1}

$w_1 ... w_ {n-1}$ (we used Laplace smoothing):

P (w_{n} | w_{1} d o t s w_{n - 1}) = f r a c c o u n t (w_{n}) + d e l t a c o u n t (w_{1} d o t s w_{n - 1}) + d e l t a | V |

$P (w_n | w_1 \ dots w_ {n-1}) = \ frac {count (w_n) + \ delta} {count (w_1 \ dots w_ {n-1}) + \ delta | V |}$

where V is our dictionary.

Option Generation

So, we can evaluate the probability of a particular request, but how to generate these same requests? It is wise to do the following: let the user enter a query

w_{1} . . . w_{n}

$w_1 ... w_n$ , then the queries that are suitable for us can be found from the condition

w_{1} d o t s w_{m} = u n d e r s e t w_{n + 1} d o t s w_{m} i n V a r g m a x P (w_{1} d o t s w_{n} w_{n + 1} d o t s w_{m})

$w_1 \ dots w_m = \ underset {w_ {n + 1} \ dots w_m \ in V} {argmax} P (w_1 \ dots w_n w_ {n + 1} \ dots w_m)$

Of course, sorting through

| V |^{m - n}, m = 1 d o t s M

$| V | ^ {m-n}, m = 1 \ dots M$ It’s not possible to select the best options for each incoming request, therefore we use Beam Search . For our n-gram language model, it comes down to the following algorithm:

def beam(initial, vocabulary): variants = [initial] for i in range(P): candidates = [] for variant in variants: candidates.extends(generate_candidates(variant, vocabulary)) variants = sorted(candidates)[:N] return candidates def generate_candidates(variant, vocabulary): top_terms = [] #         1, 2, ... n  for n0 in range(n): top_n = sorted(vocabulary, key=lambda c: P(|variant[-n0:]) top_terms.extends(top_n) candidates = [variant + [term] for term in top_terms] #       candidates = sorted(candidates, key=lambda v: P(variant))[:N] return candidates

Here the nodes highlighted in green are the final selected options, the number in front of the node

w_{n}

$w_n$ - probability

P (w_{n} | w_{n - 1})

$P (w_n | w_ {n-1})$ , after the node -

P (w 1 . . . w_{n})

$P (w1 ... w_n)$ .

It has become much better, but in generate_candidates you need to quickly get N best terms for a given context. In the case of storing only the probabilities of n-grams, we need to go through the entire dictionary, calculate the probabilities of all possible phrases, and then sort them. Obviously, this will not take off for online queries.

Boron for probabilities

To quickly obtain the N best conditional probability variants of the continuation of the phrase, we used boron in terms. In node

w_{1} t o w_{2}

$w_1 \ to w_2$ coefficient stored

a l p h a

$\ alpha$ , value

P (w_{2} | w_{1})

$P (w_2 | w_1)$ and sorted by conditional probability

P (b u l l e t | w_{1} w_{2})

$P (\ bullet | w_1 w_2)$ list of terms

w_{3}

$w_3$ together with

P (w_{3} | w_{1} w_{2})

$P (w_3 | w_1 w_2)$ . The special term eos marks the end of a phrase.
trie

But there is a nuance

In the algorithm described above, we assume that all words in the query were completed. However, this is not true for the last word that the user enters it right now. We again need to go through the entire dictionary to continue the current word being entered. To solve this problem, we use a symbolic boron, in the nodes of which we store M terms sorted by the unigram probability. For example, this will look like our bor for java, junior, jupyter, javascript with M = 3:

trie

Then, before beginning Beam Search, we find the M best candidates to continue the current word

w_{n}

$w_n$ and select the N best candidates for

P (w_{1} d o t s w_{n})

$P (w_1 \ dots w_n)$ .

Typos

Great, we have built a service that allows you to give good hints for a user request. We are even ready for new words. And everything would be fine ... But users take care and do not switch hfcrkflre keyboards.

How to solve this? The first thing that comes to mind is the search for corrections by finding the closest options for the Damerau-Levenshtein distance, which is defined as the minimum number of insertion / deletion / replacement of a character or transposition of two neighboring ones needed to get another from one line. Unfortunately, this distance does not take into account the probability of a particular replacement. So, for the introduced word “sapper”, we get that the options “collector” and “welder” are equivalent, although intuitively it seems that they had in mind the second word.

The second problem is that we do not take into account the context in which the error occurred. For example, in the query “order sapper” we should still prefer the option “collector” rather than “welder”.

If you approach the task of correcting typos from a probabilistic point of view, it is quite natural to come to a model of a noisy channel :

alphabet set $\ Sigma$ ;
set of all trailing lines $\ Sigma ^ *$ over it;
many lines that are correct words $D \ subseteq \ Sigma ^ *$ ;
given distributions $P (s | w)$ where $s \ in \ Sigma ^ *, w \ in D$ .

Then the correction task is set as finding the correct word w for input s. Depending on the source of the error, measure

P

$P$ it can be built in different ways, in our case it’s wise to try to estimate the probability of typos (let's call them elementary replacements)

P_{e} (t | r)

$P_e (t | r)$ , where t, r are symbolic n-grams, and then evaluate

P (s | w)

$P (s | w)$ as the probability of getting s from w by the most probable elementary replacements.

Let be

P a r t_{n} (x)

$Part_n (x)$ - splitting the string x into n substrings (possibly zero). The Brill-Moore model involves the calculation of probability

P (s | w)

$P (s | w)$ in the following way:

P (s | w) \ approx \ max_ {R \ in Part_n (s)} T \ in Part_n (s)} \ prod_ {i = 1} ^ {n} P_e (T_i | R_i)

$P (s | w) \ approx \ max_ {R \ in Part_n (s)} T \ in Part_n (s)} \ prod_ {i = 1} ^ {n} P_e (T_i | R_i)$

But we need to find

P (w | s)

$P (w | s)$ :

P (w | s) = f r a c P (s | w) P (w) P (s) = c o n s t c d o t P (s | w) c d o t P (w)

$P (w | s) = \ frac {P (s | w) P (w)} {P (s)} = const \ cdot P (s | w) \ cdot P (w)$

By learning to evaluate P (w | s), we will also solve the problem of ranking options with the same Damerau-Levenshtein distance and will be able to take into account the context when correcting a typo.

Calculation $P_e (T_i | R_i)$

To calculate the probabilities of elementary substitutions, user queries will help us again: we will compose pairs of words (s, w) which

close in Damerau-Levenshtein;
one of the words is more common than the other N times.

For such pairs, we consider the optimal alignment according to Levenshtein:

We compose all possible partitions of s and w (we limited ourselves to lengths n = 2, 3): n → n, pr → rn, pro → rn, ro → po, m → ``, mm → m, etc. For each n-gram, we find

P_{e} (t | r) = f r a c c o u n t (r t o t) c o u n t (r)

$P_e (t | r) = \ frac {count (r \ to t)} {count (r)}$

Calculation $P (s | w)$

Calculation

P (s | w)

$P (s | w)$ directly takes

O (2^{| w | + | s |})

$O (2 ^ {| w | + | s |})$ : we need to sort through all possible partitions of w with all possible partitions of s. However, the dynamics on the prefix can give an answer for

O (| w | * | s | * n^{2})

$O (| w | * | s | * n ^ 2)$ where n is the maximum length of elementary substitutions:

d [i, j] = \ begin {cases} d [0, j] = 0 & j> = k \\ d [i, 0] = 0 & i> = k \\ d [0, j] = P (s [0: j] \ space | \ space w [0]) & j <k \\ d [i, 0] = P (s [0] \ space | \ space w [0: i]) & i <k \\ d [i, j] = \ underset {k, l \ le n, k \ lt i, l \ lt j} {max} (P (s [jl: j] \ space | \ space w [ik: i]) \ cdot d [ik-1, jl-1]) \ end {cases}

$d [i, j] = \ begin {cases} d [0, j] = 0 & j> = k \\ d [i, 0] = 0 & i> = k \\ d [0, j] = P (s [0: j] \ space | \ space w [0]) & j <k \\ d [i, 0] = P (s [0] \ space | \ space w [0: i]) & i <k \\ d [i, j] = \ underset {k, l \ le n, k \ lt i, l \ lt j} {max} (P (s [jl: j] \ space | \ space w [ik: i]) \ cdot d [ik-1, jl-1]) \ end {cases}$

Here P is the probability of the corresponding row in the k-gram model. If you look closely, it is very similar to the Wagner-Fisher algorithm with Ukkonen clipping. At every step we get

P (w [0 : i] | s [0 : j])

$P (w [0: i] | s [0: j])$ by enumerating all the fixes

w [i - k : i]

$w [i-k: i]$ at

s [j - l : j]

$s [j-l: j]$ provided

k, l l e n

$k, l \ le n$ and the choice of the most probable one.

Back to $P (w | s)$

So, we can calculate

P (s | w)

$P (s | w)$ . Now we need to select several options w maximizing

P (w | s)

$P (w | s)$ . More precisely, for the original request

s_{1} s_{2} d o t s s_{n}

$s_1s_2 \ dots s_n$ you must choose

w_{1} d o t s w_{n}

$w_1 \ dots w_n$ where

P (w_{1} d o t s w_{n} | s_{1} d o t s s_{n})

$P (w_1 \ dots w_n | s_1 \ dots s_n)$ maximum. Unfortunately, an honest choice of options did not fit into our response time requirements (and the project deadline was drawing to a close), so we decided to focus on the following approach:

from the original query we get several options by changing the k last words:
1. we correct the keyboard layout if the resulting term has a probability several times higher than the original one;
2. we find words whose Damerau-Levenshtein distance does not exceed d;
3. choose from them top-N options for $P (s | w)$ ;
send BeamSearch to the input along with the original request;
when ranking the results we discount the obtained options on $\ prod_ {i = 0} ^ {k-1} P (s_ {n-i} | w_ {n-i})$ .

For Clause 1.2, we used the FB-Trie algorithm (forward and backward trie), based on fuzzy search in the forward and reverse prefix trees. This turned out to be faster than evaluating P (s | w) throughout the dictionary.

Query Statistics

With the construction of the language model, everything is simple: we collect statistics on user queries (how many times we made a request for a given phrase, how many users, how many registered users), we break down requests into n-grams and build burs. More complicated with the error model: at a minimum, a dictionary of the right words is needed to build it. As mentioned above, to select the training pairs, we used the assumption that such pairs should be close in Damerau-Levenshtein distance, and one should occur more often than the other several times.

But the data is still too noisy: xss injection attempts, incorrect layout, random text from the clipboard, experienced users with requests “programmer c not 1c”, ~~requests from the cat that passed through the keyboard~~ .

For example, what did you try to find by such a request?

Therefore, to clear the source data, we excluded:

low frequency terms;
Containing query language operators
obscene vocabulary.

They also corrected the keyboard layout, checked against words from the texts of vacancies and open dictionaries. Of course, it was not possible to fix everything, but such options are usually either completely cut off or located at the bottom of the list.

In prod

Right before project protection, they launched a service in production for internal testing, and after a couple of days - for 20% of users. In hh.ru, all changes that are significant to users go through a system of AB tests , which allows us not only to be sure of the significance and quality of the changes, but also to find errors .

metric

The metric of the average number of searches from the sujest to the applicants has brightened up (increased from 0.959 to 1.1355), and the share of searches from the sujest of all search queries has increased from 12.78% to 15.04%. Unfortunately, the main product metrics have not grown, but users have definitely become more likely to use tips.

In the end

There was no room for a story about the School's processes, other tested models, the tools that we wrote for model comparisons, and meetings where we decided which features to develop in order to catch up with the intermediate demos. Look at the records of the past school , leave a request at https://school.hh.ru , complete interesting tasks and come to study. By the way, the service for checking tasks was also done by the graduates of the previous set.

What to read?

Source: https://habr.com/ru/post/464415/

All Articles