Abundantia Verborum
4. Virtual corpora and corpus linguistics
4.4 Summary
We conclude the chapter with an enumeration
of the most important features of Abundantia Verborum
with respect to searching corpora.
- Abundantia Verborum requires corpora to be
collections of plain texts files, possibly enriched with
markup. The orientation towards this format is motivated
by common practice and by
the argument that all use of proprietary formats
will eventually pose serious limits to the accessibility
of data.
- The virtual corpus mechanism, and the embedded
virtual corpus view mechanism, support the
recognition of several markup styles, including
TEI-compliant SGML. The choice to support many
different styles was motivated by the heterogeneity
of currently available data.
- The corpus preparation tool was constructed
to reduce the need for complication in the design
and time-intensive computation in the use
of the virtual corpus mechanism. Moreover, it
is a useful general purpose tool.
- The query language of the program is based on
regular expressions and word-based searches. These components
were chosen for they are widely known and because in combination
with the virtual corpus mechanism they form a solid basis
for, respectively, general and linguistic queries.
Due to its typed nature, the query language can easily be expanded.
- Abundantia Verborum corpus exploration uses neither
indices nor data encryption. This is because the typical use
of Abundantia Verborum is to linguistically explore locally available
corpora. Locally available linguistic corpora typically are fast changing
and freely accessible. Moreover, linguistic exploration
typically involves sophisticated ad hoc questions rather
than high-frequent stereotypical behaviour.
This current orientation, however, is not a fundamental
limitation. Both the use of indices and the use of data
encryption/decryption could be built in.
Back to table of contents