Python library ruword_frequency returns frequency (ipm - items per million) of russian words, case insensitive.
It based on huge collection of russian documents and prepared word frequency sources. Full list:
- Wikipedia dump, russian segment
- Flibusta dump, more then 200 Gb of texts
- Pyhlyi's library
- Новый частотный словарь русской лексики
- Словарь русской литературы from http://speakrus.ru/dict/index.htm
- Частотный словарь Марка фон Хагена see description
Word's ipm from all enumerated sources was extracted and mean values used. Full index contains more them 7 billions word forms including mistakes from raw data sources (unfortunately).
- Python 3
- Word index occupies near 50 Mb on hard disk and will be downloaded first time you invoke
frequency.load()method
pip install ruword_frequency
from ruword_frequency import Frequency
freq = Frequency()
freq.load()
freq.ipm('привет')
>>> 53.51823806762695
freq.ipm('неттакогослова')
>>> 0.0
# get max ipm value. For weights normalization, for example
freq.max_ipm()
>>> 42329.2890625
# get list of most used words with ipm more then 10000
for w in freq.iterate_words(10000):
print(w)
For other useful methods see marisa-trie documentations.
Tree index available as freq.tree
from ruword_frequency.source_reader import SourceReader
reader = SourceReader()
# increase socket timeout, sometimes helpful for huge file downloading:
import socket
socket.setdefaulttimeout(60)
reader.download_all_sources()
tree = reader.build_tree_from_dictionaries()
reader.save_tree(tree)
# use it
freq = Frequency()
freq.ipm('привет')