Abundantia Verborum

3. Tutorial

3.2 Classifying the data


3.2.6 Label taxonomies

In Abundantia Verborum data classification means: making implicit information explicit, so that after this phase everything about an observation we want to take into account in our analysis is represented in an explicit and unambiguous way; more precisely: is represented by a label.

In the sections 3.2.2 Adding labels via queries, 3.2.3 Adding labels via filters and 3.2.4 Adding labels via zooming we saw which techniques can be used for explicitating information that has some formal counterpart or trace in the physical data, so that we can automate the process of label assignment. In 3.2.5 Manually adding labels we saw how to add labels ourselves when the computer, so to speak, is not intelligent enough to be taught how to do it.

In this section we discuss a last type of information that can be explicitated. Implicit information can not only reside in the data from the corpus, but also in the labels we have already assigned. If this is the case, it is usually easier to automatically explicitate such information on the basis of already assigned labels than on the basis of the actual data. Let us take an example from the previous section. If a particular observation has been assigned the label SAID-OF:food, then we humans can infer from this that "old" is said here of an organic entity (at least for most ingredients), and when SAID-OF:human has been assigned, then we can infer that "old" is said of an animate entity (at least in some stage of its existence). There is a structural, as opposed to accidental, implication relation between being food and being mostly organic, and between being human and being animate. Such structural implication rules can be taught to Abundantia Verborum.

If it is not already open, then open the workshop "demowork.wrk", and choose "Workshop | Browse Labels..."! In the Label Browser, click on "Implication Rules"! The Label Hierarchy Table appears. In this environment we can structure our labels into one or several label taxonomies, either partially or completely. Let us take the following taxonomy as an example. It differs a bit from what we find in "demowork.wrk".

                           +---SAID-OF:human
SAID-OF:(animate entity)---|
                           +---SAID-OF:(animal)---SAID-OF:bird

Without wanting to go into matters such as whether or not in a good taxonomy 'human' belongs under or next to 'animal', we use the above midget example merely for illustrating the implication rule mechanism. For each X is a son of Y relation in a taxonomy you have to create an implication rule X->Y. In such a rule X is called the implying label, and Y is called the implied label. For specifying the above taxonomy you need the rules listed below.

On the basis of this example you should now be able to infer the taxonomy specified in "demowork.wrk" from the implication rules it contains. Both in the above example and in "demowork.wrk" leafs of taxonomies have names without brackets, and the nodes at a higher level have names enclosed within brackets. This is not obligatory. It's just a mnemotechnic aid for visually telling high level labels apart from bottom line labels.

The philosophy behind using implication rules is that assigning a leaf label to an observation automatically implies that its parent label and all its ancestor labels are being assigned too. Given the above rules, assigning SAID-OF:bird means assigning the high level labels SAID-OF:(animal) and SAID-OF:(animate entity) as well. What we gain is that in the analysis phase we can incorporate high level features in our calculations. We will be able to ask the program questions such as: is the distribution of such and such semantic labels similar when "old" is said of animate entities compared to when it is said of inanimate entities?

To conclude this section, let us see what implied labels look like in observations. Close all dialog boxes, if any are open, and open the Observation Browser! Double click on the fourth (!) observation in the list of observations! The Observation Editor opens, containing this observation. You see that in the list of assigned labels there is the label SAID-OF:food, but if you scroll down you also find the label SAID-OF:(organic entity), SAID-OF:(physical entity) and SAID-OF:(entity), all three being marked as "<implied>". Click on the "Graph" button to get a better picture! The Label Inclusion Graph appears. In such a tree all non-leaf nodes (apart from the root) represent implied labels. They are depicted in red (if you have a colour screen). The tree teaches you that the observation is marked as SAID-OF:food, hence SAID-OF:(organic entity), hence SAID-OF:(entity). Now navigate to the first observation and look at the Label Inclusion Graph of this observation. This is an example of multiple taxonomies. The label SAID-OF:person is specified to be both an instantiation of SAID-OF:(organic entity) and of SAID-OF:(animate entity). In this particular case the two categories come together again at a higher level. The label SAID-OF:(organic entity) implies SAID-OF:(physical entity) and both SAID-OF:(physical entity) and SAID-OF:(animate entity) imply SAID-OF:(entity). This is not necessarily the case. It is possible to specify multiple partial taxonomies that do not link up into a bigger structure.


Back to table of contents