At the beginning of this year, for a number of reasons, we got the idea to create the largest open dataset in Russian speech. More about our motivation and how it all began
can be read in this article - A huge open dataset of Russian speech . Since then, our project has gone through a series of large-scale changes, we have tripled the amount of data, improved their quality, added labels for speakers and now we are finally ready to present you version 1.0.
We are also not ready to rest on our laurels and plan to continue to do intensive work on errors in future versions and improve the quality of already published data. We plan to devote version 1.1 to large-scale work on bugs.
Briefly about Open STT v1.0
- More than 20,000 hours (initially we set the bar at 10,000 hours) of audio of Russian speech, 2.3 Tb of data (in
wav
format, in .mp3
format of course less); - A wide variety of domains: starting from audio recorded on a professional microphone, ending with phone calls:
More detailed statistics can be found in the project repository .
- Now the data can be downloaded at high speed both in
.wav
(mono, 16KHz, int16) format via torrent, or via a direct link in .mp3
; - Added a small manually labeled validation dataset (18 hours) for 3 main domains;
We have made every effort to improve the quality of the markup:
- Improved model for aliasing new domains;
- Used better and finer-tuned STT-models for alimentation;
- Improved the algorithm for normalizing numbers and Latin letters;
- Gradually re-partition / remove the "dirty" data from previous versions;
- Cured a number of children's problems dataset such as:
- Dangling single letters at the beginning and end of sentences;
- Low yield of alignment due to low quality models;
- "Correct" work with punctuation marks during an alignment;
- (Soon!) Real labels for speakers will appear;
For what tasks can our dataset come in handy?
- Speech recognition;
- Speech synthesis;
- Denoising, eliminating noise in audio;
- Voice identification;
- Separation of speakers;
How do you plan to develop the dataset in the future?
- Improve / reload existing datasets, clean markup;
- Publish models for speech recognition and postprocessing;
- Add markup with speaker id. For some of the new domains, there is a ready-made layout, but there is also the idea of ββadding speakers to the old datasets;
- It is possible to switch to other languages;
- It is possible to add several new domains;
You can learn more about new domains in the repository.