Zipf’s distribution in language: Optimal text, frequency parameters of distorted texts and authorship ratio | Vestnik Tomskogo gosudarstvennogo universiteta. Filologiya – Tomsk State University Journal of Philology. 2024. № 90. DOI: 10.17223/19986645/90/1

Zipf’s distribution in language: Optimal text, frequency parameters of distorted texts and authorship ratio

The article focuses on a family of rank distributions. These are of key importance in the linguistic study of texts within the framework of quantitative and corpus linguistics. The efforts within this research are invested into the word frequency behavior study for texts written in English: we have fixed the discrepancies that inevitably occur between the frequencies calculated by Zipfs formula and the actually observed frequencies depending on the size of the text. Thus, not only the size per se but also the reasons behind the existence of an “optimal” text size are investigated; in addition, the influence of the integrity of the text and its authorship on frequency characteristics are experimentally studied. Firstly, within the framework of the goals stated, in a series of experiments, the “optimal” text size was determined. It has to be noted that the optimal text size was predicted by George Zipf himself; this size constitutes the minimal discrepancy between the formula-based or calculated (theoretical) and actually observed frequencies. Moreover, the emphasis is put on the size of the optimal text, which proves to be a key parameter in investigating distorted texts. Secondly, this article also touches upon a number of controversial provisions that used to be expressed in relation to incomplete or distorted texts. The frequency characteristics of distorted texts are studied to verify or defy the previously proposed hypotheses, expressed, in particular by the Russian mathematician Yury Orlov (1980). The assumption, which has been taken for granted, deals with the commonly held hypothesis that the distribution of word frequencies in incomplete texts might disagree with Zipfs Law. We sought empirical proof for this assumption and thoroughly explored the correlation between observed frequencies and text’s completeness or its entirety. Contrary to expectations, the results prove that only the size of the text is crucial: the distribution remains Zipfian even for fragments of the text and randomly selected words from the text, provided that they collectively make up the text of the optimal size. Finally, this study discovers that authorship lends itself to being investigated with frequency derivatives. There is the so-called author’s ratio, defined as the relative frequency of the most frequent word, which proves insensitive to whether the texts of a given author are incomplete, distorted or even fragmentary. It remains remarkably stable throughout both complete texts and random fragments such as sentences written by any given author. The authors declare no conflicts of interests.

Download file
Counter downloads: 3

Keywords

rank distributions, theoretical and observed frequencies, optimal text size, frequency characteristics of deformed texts, author’s ratio

Authors

NameOrganizationE-mail
Gorina Olga G.National Research University Higher School of Economicsgorina@bk.ru
Tsarakova Natalya S.National Research University Higher School of Economicsn.carakova@gmail.com
Kraytorov Michael V.National Research University Higher School of Economicsmvkraytorov@edu.hse.ru
Kuganova Daria A.National Research University Higher School of Economicsdakuganova@edu.hse.ru
Petrov Igor D.National Research University Higher School of Economicsidpetrov@edu.hse.ru
Всего: 5

References

Zipf G.K. Human Behavior and the Principle of Least Effort // Science 110. 1949. № 2868. P. 669.
Арапов М.В., Ефимова Е.Н., Шрейдер Ю.А. О смысле ранговых распределений // НТИ. Сер. 2. 1975. № 1. С. 9-20.
Bowen Cai, Zhenfeng Shao, Shenghui Fang, Xiao Huang, Yun Tang, Muchen Zheng, Hao Zhang. The Evolution of urban agglomerations in China and how it deviates from Zipfs law // Geo-spatial Information Science. 2022.
Zanette D., Montemurro M. Dynamics of Text Generation with Realistic Zipfs Distribution // Journal of Quantitative Linguistics. 2005. № 12:1. Р. 29-40.
Merton R.K. The Matthew Effect in Science // Science. 1968b. Vol. 5, № 159. Р. 3810. Reprinted in: Merton, 1973.
Петров В.М., Яблонский А.И. Математика и социальные процессы: гиперболические распределения и их применение. М.: Знание, 1980. 64 с.
Estoup J.B. Gammes stenographiques: methode et exercices pour l'acquisition de la vitesse. 4e ed. rev. et aug. Paris, 1916. 151 p.
Zipf G.K. Relative frequency as a determinant of phonetic change // Harvard studies in classical philology. 1929. № 40.
O'Keeffe A., McCarthy M., Carter R. From corpus to classroom: Language use and language teaching. Cambridge: Cambridge University Press, 2007.
Scott M., Tribble C. Textual Patterns: key words and corpus analysis in language education: Studies in Corpus Linguistics. Amsterdam/Philadelphia: John Benjamins, 2006. 200 p.
Gorina O.G., Tsarakova N.S., Tsarakov S.K. Study of Optimal Text Size Phenomenon in Zipf-Mandelbrot's Distribution on the Bases of Full and Distorted Texts. Author's Frequency Characteristics and derivation of Hapax Legomena // Journal of Quantitative Linguistics. 2020. № 27:2. Р. 134-158.
Горина О.Г., Царакова Н.С. Корпусные инструменты, маршруты и эксперименты в современной лингводидактике // Вестник НГУ. Серия: Лингвистика и межкультурная коммуникация. 2021. Т. 19, № 2. С. 36-53.
Mandelbrot B.B. The Fractal Geometry of Nature. New York: Freeman, 1983.
Scott M. Wordsmith Tools: Software. Oxford: Oxford University Press, 2012.
Орлов Ю.К. Невидимая гармония // Число и мысль. Вып. 3. М.: Знание, 1980. С. 70-106.
 Zipf’s distribution in language: Optimal text, frequency parameters of distorted texts and authorship ratio | Vestnik Tomskogo gosudarstvennogo universiteta. Filologiya – Tomsk State University Journal of Philology. 2024. № 90. DOI: 10.17223/19986645/90/1

Zipf’s distribution in language: Optimal text, frequency parameters of distorted texts and authorship ratio | Vestnik Tomskogo gosudarstvennogo universiteta. Filologiya – Tomsk State University Journal of Philology. 2024. № 90. DOI: 10.17223/19986645/90/1

Download full-text version
Counter downloads: 665