Research Project sem•metrix
sem•metrix kick-off workshop
On 10 January 2007, we celebrated the start of our new
research project with the sem•metrix kick-off
workshop. Invited researchers from different fields
presented their work on semantic associations and the
automatic discovery of synonyms. Below you find the
abstracts of their presentations together with the slides.
Program
The sem·metrix
project. Scaling up the profile-based measurement of
lexical variation [pdf]
[pdf] Kris Heylen & Yves Peirsman, University of Leuven
In the first part of this talk we present the sem·metrix
project, which aims to develop tools for the large-scale
measurement of lexical variation. The project takes
as its starting point the profile-based study of lexical
variation developed by Geeraerts, Grondelaers & Speelman
(1999). Within this approach, a profile is defined as
a set of near-synonyms and their relative frequencies
in a corpus. With several corpora representative of
different language varieties, these profiles can then
be used to measure lexical variation between language
varieties with respect to a given concept, and without
thematic bias. The examination of multiple profiles,
finally, gives a more general picture of the lexical
variation between varieties.
Geeraerts, Grondelaers & Speelman (1999) succesfully
investigated the lexical variation between Belgian and
Netherlandic Dutch in the semantic fields of clothing
and football. However, their original approach proved
difficult to extend to additional concepts because of
the time-consuming manual definition of profiles. To
overcome this scalability problem, the sem·metrix project
will make use of computational-linguistic tools to automatically
identify sets of near-synonyms and their relative frequencies
in a corpus.
In the second part of the presentation, we will investigate
EuroWordNet as a possible gold standard for the automatic
discovery of synonyms. Since it is well-known that the
Dutch ontology is less rich than its English counterpart,
we will have a look at how this fact influences the
figures of semantic relatedness that can be obtained
on the basis of EuroWordNet. This will be done via two
evaluation exercises: one direct comparison with human
judgements of semantic similarity, and one task-based
exercise that uses values of semantic relatedness in
a Word Sense Disambiguation algorithm.
The application of dimensionality reduction
techniques to distributional data
[pdf] Tim Van de Cruys, University of Groningen
Distributional similarity techniques have proven to
be a very useful tool in automatically acquiring semantics
from text. In recent years, several dimensionality reduction
techniques have emerged that are able to transform the
original data into a feature space of reduced dimensionality
(of which Latent Semantic Analysis is the most famous
example). There are two reasons for performing a dimensionality
reduction: the first reason is that using the original
data might lead to intractable computations, so that
a reduction of the number of features is necessary.
The second reason is that the techniques might be able
to generalize among the data (i.e., capturing intrinsic
semantic features), so that the data is described in
a better way. This presentation will look at dimensionality
reduction techniques that are able to achieve both of
these goals, mainly in the context of noun clustering.
More specifically, random indexing and non-negative
matrix factorization will be explored. Random indexing
is a technique perfectly suited to counter intractable
computations, while non-negative matrix factorization
seems to be able to get a grasp of the ?semantic dimensions?
present in a corpus.
Finding semantically related words using distributional similarity in syntactic contexts
[pdf] Lonneke van der Plas, University of Groningen
?Similar words occur in similar contexts.? That is the
underlying idea (Harris 1968) of trying to find semantically
related words by looking at the way they are distributed
in syntactic data. We have used distributional similarity
to find semantically related words and it is our goal
to use this lexico-semantic information for question
answering.
In this talk I will describe the data we use: words
in grammatical relations extracted from an automatically
parsed corpus of roughly 500 million words. I will then
give a summary of the possible vector-based methods
for calculating the distributional similarity between
words.
The semantically related words retrieved by the system
are at first sight quite impressing, however a closer
inspection shows a drawback of the method, i.e., it
is hard to distinguish different lexical relations among
the words retrieved by the system. We find synonyms
but also antonyms, co-hyponyms, hyponyms etc. I will
give some examples using the demo we made and I will
give some scores from evaluation. If time permits, I
would like to broaden the view to lexical variation.
If we use different corpora do we see different semantically
related words appearing?
The syntactic, semantic and network properties of word associations
[pdf] Simon de Deyne, University of Leuven
Word associations have recently seen a revival as a
tool to investigate cognitive phenomena in language
and memory. The word association space (WAS), a distributional
approach similar to LSA and HAL based on word association,
gave a better account for these phenomena compared to
LSA or WordNet (Steyvers, Shiffrin & Nelson 2004). In
addition, statistical network analyses have revealed
that word associations adhere to a scale free small
world network structure. The regularities of such networks
have previously been found in other complex natural
networks such as the World Wide Web.
In this presentation, a new Dutch word association corpus
will be introduced. The syntactic distribution of association
responses is investigated together with the measures
of utility of the association responses to investigate
the determinants that drive the process of generating
associations.
Next, the theoretical implications of small-world properties
of the associative network are discussed. For instance,
it is interesting to investigate to what extent conventional
models of semantic organization, such as high-dimensional
vector spaces can account for the small world regularities
found in association networks.
Within an associative network, a very small set of nodes
function as hubs, receiving many links with other words.
To present date, the possibility that these hubs correspond
to a certain part of speech or a semantic relation type
has not been explored. In order to investigate the semantic
relations that appear in word associations, the cognitively
inspired semantic taxonomy by Wu and Barsalou was applied
to the word association data.
Apart from answering questions about the network architecture,
this semantic analysis of word associations provides
insight into the distribution of situational, entity
properties, taxonomic and introspective properties.
The resulting picture from these analyses will be discussed
in reference to the LASS theory (Language and Situated
Simulation) of conceptual processing (Barsalou, Santos,
Simmons & Wilson in press).