English Linguistics:
Applied Linguistics

K.U.Leuven
------
Home

How to use FirstClass

How to use the COBUILD corpus (1)

How to use the COBUILD corpus (2)

Comparison with Dutch: the INL corpus

Google it!

Reporting on your progress and findings

Reference material

Output

Contact

Search K.U.Leuven

 

 

 

Google it!

The internet as corpus?

While several definitions of what a (linguistic) 'corpus' is are around (consider these few, for example), they all agree that a corpus is a collection of naturally occurring texts that has been deliberately collected on the basis of certain criteria (e.g. dialectal, historical, and generic variation) with a view to performing linguistic analysis on it. Usually corpora are also enriched in various ways, that is to say, information such as the source of the text, for written data perhaps the sociological background of the speakers that have been transcribed, and part of speech tagging have been added to the corpus.

The idea that the internet is a giant corpus that is freely available to everyone may seem attractive at first blush, but it is important to realize that many of the features listed above are missing in 'the internet-as-corpus': while one could view it as a collection of naturally occurring texts, these texts have not been deliberately collected on the basis of certain criteria and with certain goals in mind, nor have they been enriched. This entails a number of important drawbacks for trying to use the internet as a corpus:

  • because the internet is (obviously) not tagged for parts of speech, you can only search on word forms, not on parts of speech; in addition, at least in Google you cannot use wildcards so as to cover several word forms in one go (e.g. "aggrav*"); this means that you can only search on words and exact phrases
  • because the internet does not consist of "subcorpora" representing different genres (such as newspaper language, narrative, radio broadcasts, conversation), it is not possible to compare results across different genres -- note that some of these genres (but not all, e.g. you will not find a lot of transcribed naturally occurring conversation) are of course available, but there is no way of limiting your search to those genres (except by manually limiting your search to one website, which is highly impracticable)
  • because there is no way of 'measuring the size' of the internet nor of parts of it, you can never gain reliable quantitative data on the basis of internet searches; for instance, you cannot calculate frequencies since you have no idea how many sites with how many words have been searched, nor is it possible to reliably compare e.g. frequencies for a given construction in Belgian Dutch and Netherlandish Dutch (e.g. "zoiets van" site:.nl yields about 8150 tokens, "zoiets van" site:.be about 893, but as long as you do not know how large the number of words is on the .nl vs .be sites that were searched, you cannot safely draw any conclusions from this)

There is, however, one important advantage of using the internet as corpus: material is constantly being added to it, which means that (even though it contains many outdated pages, broken links, etc.) it holds good chances of reflecting very quickly new language developments which have not yet reached the corpora, and also of yielding more results for more or less recent developments for which corpora throw up only few results. Consider these two examples:

A Cobuild corpus search on "mister+so+called" yields no results; one on "mr+so+called" yields only two tokens (Mr so-called Jagger, Mr so-called Graham). Using Google, it becomes possible to find more examples of this constructional template and to uncover the different functions so-called has acquired in it:

What do you call a person who keeps bad things away? A "bad thing keeper awayer", according to Google (whereas Cobuild has no occurrences for "keeper+awayer"):

(See Bert Cappelle (2003) Meervoudig -er bij Engelse partikelwerkwoorden. Morfologiedagen Gent.)

Search procedures

In order to search for occurrences of a word on the internet, turn to www.google.be and type in the word. If you want to search for an exact phrase (for instance "mr so called") make sure to use quotation marks around the entire phrase. If you search on mr so called rather than on "mr so called", you will get all occurrences in which mr, so, and called occur in one webpage, in any order, for instance The so-called "can spam" bill was among several Mr. Bush was signing during the day. In order words, you generate a lot of noise that you have to filter out. Let Google do this work for you!

If you want to restrict your search to sites from a certain country (.uk, .nl, .be) or from a certain domain, for instance, you can add site:.domain after your search term, as in "zoiets van" site:.nl or "zoiets van" site:.kuleuven.ac.be. Alternatively, you can do this via Google's "Advanced search", where you can also use a few other search options.

Reference material

------
K.U.Leuven - CWIS

Copyright ©2002-2005 Katholieke Universiteit Leuven

Contents: Lieven Vandelanotte

Created by: Lieven Vandelanotte

Last modified: 08-03-2005

URL: http://wwwling.arts.kuleuven.ac.be/engling/appling