Speech recognition task not yet solved

Since deep learning has entered the speech recognition scene, the number of word recognition errors has drastically decreased. But, despite all the articles that you could read, we still do not have human-level speech recognition. There are many kinds of failures for speech recognizers. For further improvement, you need to select them and try to fix them. This is the only way to move from recognition, suitable for some people most of the time, to recognition that works for all people all the time.

Improvements in the number of mistakenly recognized words. The test voice dialing was compiled on a telephone switchboard in 2000 from 40 random conversations of two people whose first language is English

To say that we have reached the level of a person in speech recognition in conversations, based only on a set of conversations from a telephone switchboard, is the same as saying that a romo mobile drives no worse than a person by testing it in a single city on a sunny day without any traffic . The recent shifts in speech recognition are amazing. But statements about human speech recognition are too bold. Here are some areas that still need to be improved.

Accents and noise

One of the obvious drawbacks of speech recognition is the processing of accents and background noise. The main reason for this is that most of the training data consists of an American signal with a high signal-to-noise ratio. For example, in a set of conversations with a telephone switchboard, there are only conversations of people whose native language is English (for the most part, they are Americans) with a small background noise.

But an increase in training data alone will probably not solve this problem. There are many languages ​​that contain many dialects and accents. It is unrealistic to collect tagged data for all cases. Creating a high-quality speech recognizer for American English only requires up to 5,000 hours of audio recordings translated into text.

Comparison of people involved in converting speech to text with Baidu's Deep Speech 2 on different types of speech. People are worse at recognizing non-American accents — perhaps because of the abundance of Americans among them. I think that people who grew up in a particular region, with a much smaller number of errors, would have managed to recognize the accent of this region.

In the presence of background noise in a moving car, the signal-to-noise ratio can reach values ​​of -5 dB. People easily cope with the recognition of the speech of another person in such conditions. Automatic discriminators degrade performance much faster with increased noise. The graph shows how greatly increases the separation of people with increasing noise (at low SNR values, signal-to-noise ratio)

Semantic errors

Often the number of mistakenly recognized words is not the end in itself of a speech recognition system. We aim at the number of semantic errors. This is the proportion of expressions in which we incorrectly recognize the meaning.

An example of a semantic error is when someone suggests “let's meet up Tuesday” [let's meet on Tuesday] and the recognizer gives out “let's meet up today” [let's meet today]. There are also errors in words without semantic errors. If the recognizer did not recognize the “up” and issued “let's meet Tuesday”, the semantics of the sentence did not change.

We need to carefully use the number of mistakenly recognized words as a criterion. To illustrate this, I will give you an example of the worst possible cases. 5% of the errors in the words correspond to one missing word out of 20. If each sentence has 20 words (which is quite average for the English language), then the number of incorrectly recognized sentences approaches 100%. One can hope that incorrectly recognized words do not change the semantic meaning of sentences. Otherwise, the recognizer may incorrectly decode each sentence, even with 5% of the number of mistakenly recognized words.

When comparing models with people, it is important to check the essence of mistakes and monitor not only the number of incorrectly recognized words. In my experience, people translating speech into text make fewer mistakes and they are not as serious as computers.

Researchers from Microsoft recently compared the errors of people and computer recognizers of a similar level. One of the differences found is that the model confuses “uh” [uh-uh ...] with “uh huh” [aha] much more often than people. The two terms have very different semantics: “uh” fills in the pauses, and “uh huh” means confirmation from the listener. Also, the models and people found many errors of the same types.

Many voices in one channel

Recognizing recorded phone conversations is also easier because each speaker was recorded on a separate microphone. There is no overlap of multiple voices in the same audio channel. People can understand several speakers, sometimes speaking at the same time.

A good speech recognizer must be able to divide the audio stream into segments depending on the speaker (to diarize him). He must also make sense of the audio with two overlapping voices (separation of sources). This must be done without a microphone located right at the mouth of each of the speakers, that is, so that the recognizer works well, being placed in an arbitrary location.

Recording quality

Accents and background noise are just two factors to which the speech recognizer must be stable. Here are a few more:

• Reverberation in different acoustic conditions.
• Equipment artifacts.
• Artifacts of the codec used to record and compress the signal.
• Sampling frequency.
• Age of the speaker.

Most people do not distinguish by ear recordings from mp3 and wav files. Before declaring indicators comparable to human indicators, discriminators must become resilient to the listed sources of variation.


You may notice that the number of errors that people make on the tests in the records from the telephone exchange is quite high. If you talked with a friend who would not understand 1 word out of 20, it would be very difficult for you to communicate.

One of the reasons for this is contextless recognition. In real life, we use many different additional features that help us understand what the other person is saying. Some examples of context used by people and ignored by speech recognizers are:

• Conversation history and topic under discussion.
• Visual clues about the speaker - facial expressions, lip movement.
• The body of knowledge about the person we are talking to.

Now the speech recognizer in Android has a list of your contacts, so he knows how to recognize the names of your friends . Voice search on maps uses geolocation to narrow down the number of possible options to which you want to build a route.

The accuracy of recognition systems increases with the inclusion of such signals in the data. But we are just beginning to delve into the type of context that we could include in the processing and in the methods of its use.


Recent advances in the recognition of conversational speech can not be deployed. Imagine the deployment of a speech recognition algorithm, you need to remember about delays and computing power. These parameters are related because the algorithms that increase the power requirements increase the delay. But for simplicity, we discuss them separately.

Delay: the time from the end of the user's speech to the end of the receipt of transcription. A small delay is a typical requirement for recognition. It greatly affects the user experience of working with the product. Often there is a limit of tens of milliseconds. This may seem too strict, but remember that issuing a decryption is usually the first step in a series of complex calculations. For example, in the case of Internet voice search after speech recognition you need to still have time to perform a search.

Bidirectional recurrent layers are a typical example of improvement that worsens the situation with delay. All the latest results of high quality decoding are obtained with their help. The only problem is that we cannot count anything after the passage of the first bidirectional layer until the person has finished speaking. Therefore, the delay increases with the length of the sentence.

Left: direct recurrence allows decoding to begin immediately. Right: bidirectional recurrence requires waiting for the end of speech before decoding.

A good way to effectively incorporate future information into speech recognition is still being sought.

Computational power: this parameter is affected by economic constraints. It is necessary to take into account the cost of the banquet for each improvement in the recognition accuracy. If the improvement does not reach the economic threshold, deploy it will not work.

A classic example of continual improvement that never takes place is joint deep learning [ensemble]. Reducing the number of errors by 1-2% rarely justifies an increase in computing power by 2-8 times. Modern models of recurrent networks also fall into this category, since it is very unprofitable to use them in the search for a bunch of trajectories, although I think the situation will change in the future.

I want to clarify - I'm not saying that improving recognition accuracy with a serious increase in computational costs is useless. We have already seen how the principle of “at first slowly, but accurately, and then quickly” works in the past. The point is that until the improvement is fast enough, you cannot use it.

Over the next five years

In the field of speech recognition there are still many unsolved and complex problems. Among them:

• Expanding the capabilities of new data storage systems, recognition of accents, speech against the background of strong noise.
• Inclusion of context in the recognition process.
• Diarization and separation of sources.
• Number of semantic errors and innovative methods for evaluating recognizers.
• Very low latency.

I look forward to the progress that will be made in the next five years on these and other fronts.

Source: https://habr.com/ru/post/408017/

All Articles