Huge open dataset of Russian speech version 1.0

image


At the beginning of this year, for a number of reasons, we got the idea to create the largest open dataset in Russian speech. More about our motivation and how it all began
can be read in this article - A huge open dataset of Russian speech . Since then, our project has gone through a series of large-scale changes, we have tripled the amount of data, improved their quality, added labels for speakers and now we are finally ready to present you version 1.0.


We are also not ready to rest on our laurels and plan to continue to do intensive work on errors in future versions and improve the quality of already published data. We plan to devote version 1.1 to large-scale work on bugs.


Briefly about Open STT v1.0



DomainannotationPhrasesClockGB
RadioAlignment8.3M11,9961367
Public speakingAlignment1.7M2,709301
YoutubeSubtitles2.6M2,117346
BooksAlignment / ASR1.3M1,632180
CallsASR695K81991
Other datasetsTTS, recitation1.9M83595

More detailed statistics can be found in the project repository .



We have made every effort to improve the quality of the markup:



For what tasks can our dataset come in handy?



How do you plan to develop the dataset in the future?



You can learn more about new domains in the repository.



Source: https://habr.com/ru/post/474462/


All Articles