The History of Corpus Linguistics (On the Example of the English Language Corpora)
The aim of the research is to review the milestones in the development of corpus linguistics and present an original classification of the main periods in formation and development of English-language corpora which includes the following four periods: (a) the “pre-electronic” period or the period of text archives which lasted for over several centuries and finished in the 1960s; (b) “the first generation” covers the period from the 1960s to the mid-1990s; (c) “the second generation” period of megacorpora corresponds to the last decade of the 20th century; (d) the third generation period of gigacorpora started in the mid-2000s. The pre-electronic corpora and concordances lacked a unified system of text collection, views on representative size, and sources of corpora. In this period, there were developed the basic principles of concordance collection, the KWIC system, lemmatization. The first generation corpora were mostly compiled for the study of certain genres and/or speech of certain groups of people. These corpora typically contained texts with a limited number of tokens, usually no more than 2,000. Among the most significant achievements of that period are The Brown Corpus and the London-Oslo-Bergen corpus, the first reference corpora, which were used for lexical and grammatical studies of “language in use”, the first concordance software (CLOC, COCOA), and the first automatic tagging software (TAGGIT). By the early 1990s, the following terms were introduced, specified and defined: “corpus linguistics”, “metatext”, “tagging”, “concordancer”, “POS-tagging”, “tokenization”, “segmentation”, “parsing”. The problem of a standardized corpus, its compilation, and tagging were addressed in the project of Text Encoding Initiative (1987). The annotation patterns of that period began requiring POS, syntactic, semantic, and other tagging. Concordances of the mid-2000s became faster and more user friendly. Representativeness in corpora was achieved by the presence of texts of spoken and written speech in various communicative events. Therefore, the referential corpora of the second generation (BNC, ANC) represent the national language with a wide range of both written and spoken genres in many territorial dialects. The size of the third generation corpora or gigacorpora (COCA, Google Books) was increased to several billion tokens, and they became dynamic. The installed software enables tracking the form, meaning, and use of words and n-grams in written and spoken texts in a number of languages covering several historical periods. Modern concordances are also tools for compilation of small subcorpora and contrasting the obtained results with those of the larger corpora (BNC, COCA).
Keywords
история лингвистики,
корпусы текстов,
корпусная лингвистика,
поколения корпусов,
классификация корпусов,
history of linguistics,
text corpora,
corpus linguistics,
corpus generations,
corpus classificationAuthors
Solnyshkina Marina I. | Kazan (Volga Region) Federal University | mesoln@yandex.ru |
Gatiyatullina Galiya M. | Kazan (Volga Region) Federal University | ggaliya-m@mail.ru |
Всего: 2
References
Kennedy G. An Introduction to Corpus linguistics. Addison Wesley Longman limited, 1998. 315 p.
Baker P., Hardie A., McEnery T. Glossary of Corpus Linguistics. Edinburgh University Press, 2006. 192 p.
McEnery T., Hardie A. Corpus Linguistics: Method, theory and practice. Cambridge university press, 2012. 312 p.
Cruden A. A Complete Concordance to Holy Scriptures of Old and New Testament. 1737. 756 p.
Stubbs J. Notes on the History of Corpus Linguistics and Empirical Semantics // Collocations and Idioms / eds by M. Nenonen, S. Niemi. Joensuu: Joensuun Yliopisto, 2007. P. 317-329.
Meyer Ch.F. Pre-electronic corpora // Corpus Linguistics: An International Handbook / ed. by A. Ludeling, M. Kyto. 2008. P. 1-14.
McCarthy M., O'Keeffe A. Historical perspective: What are corpora and how have they evolved? // The Routledge handbook of corpus linguistics / ed. by A. O'Keeffe and M. McCarthy. 2010. P. 3-13.
Strong J. Strong's Exhaustive Concordance of the Bible. 1890. 1807 p.
Becket A. A concordance to Shakespear. suited to all the editions. 1787. 470 p.
Dramatic Works with Explanatory Notes. A New Ed., to which is Now Added a Copious Index to the Remarkable Passages and Words by Samuel Ayscough. 1790. Vol. 2. 558 p.
Cowden Clarke M. V. The Complete Concordance to Shakespeare. being a verbal index to all the passages in the dramatic works of the poet. 1847. 890 p.
Tribble C. What are concordances and how are they used // The Routledge handbook of corpus linguistics / ed. by A. O'Keeffe, M. McCarthy. 2010. P. 167-183.
Jespersen O. A modern English grammar. on historical principles. 1949. 542 p.
Korycinski C., Newell A.F. Text indexing. the problem of significance // Computers and writing. State of the Art / ed. by P.O. Holt [et al.]. 1992. P. 149-171.
Busa R. The Annals of Humanities Computing. The Index Tomisticus // Computers and the Humanities. 1980. Vol. 14. P. 83-90.
Quirk R. A grammar of contemporary English. 1972. 1120 p.
Svartvik J. Corpus linguistics 25+ years // Corpus Linguistics 25 Years On / ed. by R. Faccinetti. 2007. P. 11-27.
Johansson S. Some aspects of the development of corpus linguistics in the 1970-s and 1980-s // Corpus Linguistics. An International Handbook / ed. by A. Ludeling, M. Kyto. 2008. P. 33-53.
The Brown Corpus. URL. https./Zwww!.essex.ac.uk/linguistics/external/clmt/w3c/ corpus ling/content/corpora/list/private/brown/brown.html (дата обращения. 20.06.2018).
Nguen T.H., Nunavath V., Prinz A. Big Data Metadata Management in small Grids // Big Data and Internet of Things. A Roadmap for Smart Environments. 2014. P. 189-215.
The LOB Corpus. URL. http://www.helsinki.fi/varieng/CoRD/corpora/LOB/index.html (дата обращения. 20.06.2018).
Xiao R. Well-known and influential corpora // Corpus Linguistics. An International Handbook / ed. by A. Ludeling, M. Kyto. 2008. P. 383-457.
The LLC. URL. http://www.helsinki.fi/varieng/CoRD/corpora/LLC/index.html (дата обращения. 20.06.2018).
Lamel L., Cole R. Spoken Language Corpora // Survey of the State of the Art in Human Language Technology. 1997. P. 338-391.
TIDIGITS. URL. https./Zcatalog.ldc.upenn.edu/LDC93S10 (дата обращения. 20.06.2018).
DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus. CD-ROM / J.S.Garofolo [et al.]. 1993. 94 p.
Resource Management Corpus. URL. https://catalog.ldc.upenn.edu/LDC93S3C (дата обращения. 20.06.2018).
Tur G. Spoken Language Understanding. Systems for Extracting Semantic Information from Speech / ed. by G. Tur, R. De Mori. 2011. 470 p.
Corpus annotation. URL. http./Zucrel.lancs.ac.uk/annotation.html (дата обращения. 20.06.2018).
McNeill D. Hand and Mind. What Gestures Reveal About Thought. Chicago. University of Chicago Press, 1992.
Rowley-Jolivet E. Visual discourse in scientific conference papers A genre-based study // English for Specific Purposes. 2002. Vol. 21, iss. 1. P. 19-40.
ELAN. URL. https./Ztla.mpi.nl/tools/tla-tools/elan/release-notes (дата обращения. 20.06.2018).
Crawford Camiciottol B., Fortanet-Gomez I. Multimodal Analysis in Academic Settings. From Research to Teaching. Routledge, 2015. 251 p.
Lou Burnard. The Evolution of the Text Encoding Initiative. From Research Project to Research Infrastructure // Journal of the Text Encoding Initiative. June 2013. Is. 5. Online since 21 June 2013, connection on 01 April 2018. URL. http.Z/iournals.openedition.org/ jtei/811; DOI. 10.4000/itei.811
TEI Guidelines. URL. http./Zwww.teic.org/Guidelines (дата обращения. 20.06.2018).
Introducing the guidelines. URL: https://tei-c.org/support/learn/introducing-the-guidelines/. (дата обращения: 20.06.2018).
Meyer Charles F. English Corpus Linguistics: An Introduction. Cambridge University Press, 2004. 168 p.
Kubler H., Zinsmeister S. Corpus linguistics and linguistically annotated corpora. 2015. 320 p.
Leech G. Corpus annotation schemes // Literary and Linguistic Computing. 1993. № 8 (4). P. 275-281.
The history of COBUILD. URL: https://www.collinsdictionary.com/cobuild/ (дата обращения: 20.06.2018).
Sinclair J. Corpus, Concordance, Collocation. Oxford University Press, 1991.
Word Bank Online (Bank of English) режим доступа. URL: https://corpus.byu.edu/coca/old/help/compare boe.asp (дата обращения: 20.06.2018).
Biber D., Conrad S., Reppen R. Corpus linguistics: Investigating language structure and use. Cambridge University Press, 1998.
Biber D. Representativeness in corpus design // Literary and Linguistic computing. 1993. Vol. 8 (4). P. 243-257.
Sinclair J. Corpus and Text - Basic Principles // Developing Linguistic Corpora: a Guide to Good Practice / ed. by M. Wynne. 2005. P. 1-16.
Tognini-Bonelli E. Corpus linguistics at work. Amsterdam : John Benjamins, 2001.
The Longman Corpus Network. URL: http://www.longmandictionari-esusa.com/longman/corpus (дата обращения: 20.06.2018).
The British National Corpus. URL: http://www.natcorp.ox.ac.uk (дата обращения: 20.06.2018).
Leech G. A brief users' guide to the grammatical tagging of the British National Corpus. URL: http://www.natcorp.ox.ac.uk/docs/gramtag.html (дата обращения: 20.06.2018).
UCREL CLAWS5 tagset. URL: http://ucrel.lancs.ac.uk/claws5tags.html (дата обращения: 20.06.2018).
Introduction by word-class to the claws7 tagging scheme. URL: http://www.natcorp. ox.ac.uk/docs/claws7.html# Toc334867959 (дата обращения: 20.06.2018).
UCREL Semantic Analysis System (USAS). URL: http://ucrel.lancs.ac.uk/usas/ (дата обращения: 20.06.2018).
The International Corpus of English. URL: http://www.ucl.ac.uk/english-usage/projects/ice.htm (дата обращения: 20.06.2018).
Laurence A. A critical look at software tools in corpus linguistics // Linguistic Research. 2013. № 30 (2). P. 141-161.
Davies M. Corpora: an introduction // The Cambridge handbook of Corpus Linguistics / ed. by D. Biber, R. Reppen. Cambridge University Press, 2015. P. 11-31.
Mauranen A. Speaking professionally in L2 // Variation and change in spoken and written discourse: Perspectives from Corpus Linguistics / ed. by J. Bamford, S. Cavalereri, G. Diani. 2013. P. 5-31.
Kuebler S., Zinsmeister H. Corpus Linguistics and Linguistically Annotated Corpora. London : Bloomsbury Publishing, 2015. 320 p.
Flowerdew L. The argument for using English specialized corpora to understand academic and professional language // Discourse in professions: perspectives from Corpus Linguistics / ed. by U. Connor, T. Upton. 2004. P. 11-33.
Biber D. University Language: A Corpus-based Study of Spoken and Written Registers. Amsterdam : John Benjamins, 2006. 261 p.
Hyland K. As it can be seen: Lexical bundles and disciplinary variation // English for Specific Purposes. 2008. Vol. 27. P. 4-21.
Rayson P. Computational tools and methods for corpus compilation and analysis // The Cambridge handbook of English corpus linguistics / ed. by D. Biber, R. Reppen. Cambridge university press, 2015. P. 32-49.
The Corpus of Contemporary American English.URL: https://corpus.byu.edu/coca/ (дата обращения: 20.06.2018).
The Google Books Corpora. URL: http://www.helsinki.fi/varieng/CoRD/corpora/GoogleBooks/ (дата обращения: 20.06.2018).
Google Books. URL: https://googlebooks.byu.edu/ (дата обращения: 20.06.2018).
Google Books Ngram Viewer. URL: https://books.google.com/ngrams/info (дата обращения: 20.06.2018).
GloWbE. URL: https://corpus.byu.edu/glowbe/ (дата обращения: 20.06.2018).
Koester A. Building small specialized corpora // The Routledge handbook of corpus linguistics. 2010. P. 66-80.