Zipf’s distribution in language: Optimal text, frequency parameters of distorted texts and authorship ratio
The article focuses on a family of rank distributions. These are of key importance in the linguistic study of texts within the framework of quantitative and corpus linguistics. The efforts within this research are invested into the word frequency behavior study for texts written in English: we have fixed the discrepancies that inevitably occur between the frequencies calculated by Zipfs formula and the actually observed frequencies depending on the size of the text. Thus, not only the size per se but also the reasons behind the existence of an “optimal” text size are investigated; in addition, the influence of the integrity of the text and its authorship on frequency characteristics are experimentally studied. Firstly, within the framework of the goals stated, in a series of experiments, the “optimal” text size was determined. It has to be noted that the optimal text size was predicted by George Zipf himself; this size constitutes the minimal discrepancy between the formula-based or calculated (theoretical) and actually observed frequencies. Moreover, the emphasis is put on the size of the optimal text, which proves to be a key parameter in investigating distorted texts. Secondly, this article also touches upon a number of controversial provisions that used to be expressed in relation to incomplete or distorted texts. The frequency characteristics of distorted texts are studied to verify or defy the previously proposed hypotheses, expressed, in particular by the Russian mathematician Yury Orlov (1980). The assumption, which has been taken for granted, deals with the commonly held hypothesis that the distribution of word frequencies in incomplete texts might disagree with Zipfs Law. We sought empirical proof for this assumption and thoroughly explored the correlation between observed frequencies and text’s completeness or its entirety. Contrary to expectations, the results prove that only the size of the text is crucial: the distribution remains Zipfian even for fragments of the text and randomly selected words from the text, provided that they collectively make up the text of the optimal size. Finally, this study discovers that authorship lends itself to being investigated with frequency derivatives. There is the so-called author’s ratio, defined as the relative frequency of the most frequent word, which proves insensitive to whether the texts of a given author are incomplete, distorted or even fragmentary. It remains remarkably stable throughout both complete texts and random fragments such as sentences written by any given author. The authors declare no conflicts of interests.
Keywords
rank distributions, theoretical and observed frequencies, optimal text size, frequency characteristics of deformed texts, author’s ratioAuthors
Name | Organization | |
Gorina Olga G. | National Research University Higher School of Economics | gorina@bk.ru |
Tsarakova Natalya S. | National Research University Higher School of Economics | n.carakova@gmail.com |
Kraytorov Michael V. | National Research University Higher School of Economics | mvkraytorov@edu.hse.ru |
Kuganova Daria A. | National Research University Higher School of Economics | dakuganova@edu.hse.ru |
Petrov Igor D. | National Research University Higher School of Economics | idpetrov@edu.hse.ru |
References

Zipf’s distribution in language: Optimal text, frequency parameters of distorted texts and authorship ratio | Vestnik Tomskogo gosudarstvennogo universiteta. Filologiya – Tomsk State University Journal of Philology. 2024. № 90. DOI: 10.17223/19986645/90/1