![]() |
English Linguistics: |
| |
|
| |||
| Home
How to use the COBUILD corpus (1) How to use the COBUILD corpus (2) Comparison with Dutch: the INL corpus Reporting on your progress and findings
|
Google it!The internet as corpus? While several definitions of what a (linguistic) 'corpus' is are around (consider these few, for example), they all agree that a corpus is a collection of naturally occurring texts that has been deliberately collected on the basis of certain criteria (e.g. dialectal, historical, and generic variation) with a view to performing linguistic analysis on it. Usually corpora are also enriched in various ways, that is to say, information such as the source of the text, for written data perhaps the sociological background of the speakers that have been transcribed, and part of speech tagging have been added to the corpus. The idea that the internet is a giant corpus that is freely available to everyone may seem attractive at first blush, but it is important to realize that many of the features listed above are missing in 'the internet-as-corpus': while one could view it as a collection of naturally occurring texts, these texts have not been deliberately collected on the basis of certain criteria and with certain goals in mind, nor have they been enriched. This entails a number of important drawbacks for trying to use the internet as a corpus:
There is, however, one important advantage of using the internet as corpus: material is constantly being added to it, which means that (even though it contains many outdated pages, broken links, etc.) it holds good chances of reflecting very quickly new language developments which have not yet reached the corpora, and also of yielding more results for more or less recent developments for which corpora throw up only few results. Consider these two examples: A Cobuild corpus search on "mister+so+called" yields no results; one on "mr+so+called" yields only two tokens (Mr so-called Jagger, Mr so-called Graham). Using Google, it becomes possible to find more examples of this constructional template and to uncover the different functions so-called has acquired in it:
What do you call a person who keeps bad things away? A "bad thing keeper awayer", according to Google (whereas Cobuild has no occurrences for "keeper+awayer"):
(See Bert Cappelle (2003) Meervoudig -er bij Engelse partikelwerkwoorden. Morfologiedagen Gent.) Search procedures In order to search for occurrences of a word on the internet, turn to www.google.be and type in the word. If you want to search for an exact phrase (for instance "mr so called") make sure to use quotation marks around the entire phrase. If you search on mr so called rather than on "mr so called", you will get all occurrences in which mr, so, and called occur in one webpage, in any order, for instance The so-called "can spam" bill was among several Mr. Bush was signing during the day. In order words, you generate a lot of noise that you have to filter out. Let Google do this work for you! If you want to restrict your search to sites from a certain country (.uk, .nl, .be) or from a certain domain, for instance, you can add site:.domain after your search term, as in "zoiets van" site:.nl or "zoiets van" site:.kuleuven.ac.be. Alternatively, you can do this via Google's "Advanced search", where you can also use a few other search options. Reference material
| ||
|
| |||
![]() |
Copyright ©2002-2005 Katholieke Universiteit Leuven Contents: Lieven Vandelanotte Created by: Lieven Vandelanotte Last modified: 08-03-2005 URL: http://wwwling.arts.kuleuven.ac.be/engling/appling | ||