Search for Staff Students Organisational chart Search matrix Keywords

Research Project sem•metrix
sem•metrix kick-off workshop



On 10 January 2007, we celebrated the start of our new research project with the sem•metrix kick-off workshop. Invited researchers from different fields presented their work on semantic associations and the automatic discovery of synonyms. Below you find the abstracts of their presentations together with the slides.

The sem·metrix project. Scaling up the profile-based measurement of lexical variation [pdf] [pdf]
Kris Heylen & Yves Peirsman, University of Leuven

In the first part of this talk we present the sem·metrix project, which aims to develop tools for the large-scale measurement of lexical variation. The project takes as its starting point the profile-based study of lexical variation developed by Geeraerts, Grondelaers & Speelman (1999). Within this approach, a profile is defined as a set of near-synonyms and their relative frequencies in a corpus. With several corpora representative of different language varieties, these profiles can then be used to measure lexical variation between language varieties with respect to a given concept, and without thematic bias. The examination of multiple profiles, finally, gives a more general picture of the lexical variation between varieties.

Geeraerts, Grondelaers & Speelman (1999) succesfully investigated the lexical variation between Belgian and Netherlandic Dutch in the semantic fields of clothing and football. However, their original approach proved difficult to extend to additional concepts because of the time-consuming manual definition of profiles. To overcome this scalability problem, the sem·metrix project will make use of computational-linguistic tools to automatically identify sets of near-synonyms and their relative frequencies in a corpus.

In the second part of the presentation, we will investigate EuroWordNet as a possible gold standard for the automatic discovery of synonyms. Since it is well-known that the Dutch ontology is less rich than its English counterpart, we will have a look at how this fact influences the figures of semantic relatedness that can be obtained on the basis of EuroWordNet. This will be done via two evaluation exercises: one direct comparison with human judgements of semantic similarity, and one task-based exercise that uses values of semantic relatedness in a Word Sense Disambiguation algorithm.

The application of dimensionality reduction techniques to distributional data [pdf]
Tim Van de Cruys, University of Groningen

Distributional similarity techniques have proven to be a very useful tool in automatically acquiring semantics from text. In recent years, several dimensionality reduction techniques have emerged that are able to transform the original data into a feature space of reduced dimensionality (of which Latent Semantic Analysis is the most famous example). There are two reasons for performing a dimensionality reduction: the first reason is that using the original data might lead to intractable computations, so that a reduction of the number of features is necessary. The second reason is that the techniques might be able to generalize among the data (i.e., capturing intrinsic semantic features), so that the data is described in a better way. This presentation will look at dimensionality reduction techniques that are able to achieve both of these goals, mainly in the context of noun clustering. More specifically, random indexing and non-negative matrix factorization will be explored. Random indexing is a technique perfectly suited to counter intractable computations, while non-negative matrix factorization seems to be able to get a grasp of the ?semantic dimensions? present in a corpus.

Finding semantically related words using distributional similarity in syntactic contexts [pdf]
Lonneke van der Plas, University of Groningen

?Similar words occur in similar contexts.? That is the underlying idea (Harris 1968) of trying to find semantically related words by looking at the way they are distributed in syntactic data. We have used distributional similarity to find semantically related words and it is our goal to use this lexico-semantic information for question answering.

In this talk I will describe the data we use: words in grammatical relations extracted from an automatically parsed corpus of roughly 500 million words. I will then give a summary of the possible vector-based methods for calculating the distributional similarity between words.

The semantically related words retrieved by the system are at first sight quite impressing, however a closer inspection shows a drawback of the method, i.e., it is hard to distinguish different lexical relations among the words retrieved by the system. We find synonyms but also antonyms, co-hyponyms, hyponyms etc. I will give some examples using the demo we made and I will give some scores from evaluation. If time permits, I would like to broaden the view to lexical variation. If we use different corpora do we see different semantically related words appearing?

The syntactic, semantic and network properties of word associations [pdf]
Simon de Deyne, University of Leuven

Word associations have recently seen a revival as a tool to investigate cognitive phenomena in language and memory. The word association space (WAS), a distributional approach similar to LSA and HAL based on word association, gave a better account for these phenomena compared to LSA or WordNet (Steyvers, Shiffrin & Nelson 2004). In addition, statistical network analyses have revealed that word associations adhere to a scale free small world network structure. The regularities of such networks have previously been found in other complex natural networks such as the World Wide Web.

In this presentation, a new Dutch word association corpus will be introduced. The syntactic distribution of association responses is investigated together with the measures of utility of the association responses to investigate the determinants that drive the process of generating associations.

Next, the theoretical implications of small-world properties of the associative network are discussed. For instance, it is interesting to investigate to what extent conventional models of semantic organization, such as high-dimensional vector spaces can account for the small world regularities found in association networks.

Within an associative network, a very small set of nodes function as hubs, receiving many links with other words. To present date, the possibility that these hubs correspond to a certain part of speech or a semantic relation type has not been explored. In order to investigate the semantic relations that appear in word associations, the cognitively inspired semantic taxonomy by Wu and Barsalou was applied to the word association data.

Apart from answering questions about the network architecture, this semantic analysis of word associations provides insight into the distribution of situational, entity properties, taxonomic and introspective properties. The resulting picture from these analyses will be discussed in reference to the LASS theory (Language and Situated Simulation) of conceptual processing (Barsalou, Santos, Simmons & Wilson in press).
K.U.Leuven - CWIS Copyright © Katholieke Universiteit Leuven | Comments on the content: info.genling@arts.kuleuven.ac.be
Production: Yves Peirsman | Most recent update: January 24, 2007
URL: http://wwwling.arts.kuleuven.be/qlvl