Abundantia Verborum

3. Tutorial

3.3 Displaying statistics


3.3.2 Venn, Hasse and Schematic diagrams

The two most fundamental diagram types used by Abundantia Verborum, Venn and Hasse diagrams, basically serve the same purpose in the program. The Venn diagram is the most familiar type of the two, but it has some drawbacks. The Hasse diagram provides an alternative that scores better at these points. The third diagram type, the Schematic diagram, provides a more linguistics-oriented perspective on the data.

Venn diagrams

Venn diagrams need no elaborate introduction. The last few decades they have been the standard way of graphically representing sets and set membership relations. Ellipse-shaped figures represent sets. They function like containers. Although not always actually depicted, the members of a set are taken to be located within the boundaries of the ellipse. Non-members are located outside of the ellipse. If depicted, elements are often represented as dots. The overlap of two or more ellipses represents the intersection of these sets. The members of this intersection are located within this overlapping region. Non-members are outside of this region.

In Abundantia Verborum a Venn-diagram representation of a set always has the name of a label, and it represents the set of all observations in a workshop that have been assigned this label. If the workshop is filtered, in other words, if there is a filter active, then only the observations that are matched by the filter are taken to belong to the workshop at that moment. The filter mechanism was introduced in section 3.2.3 Adding labels via filters. Its relation to diagrams will be the topic of section 3.3.5 Filtered diagrams. The labels assigned to an observation include both the labels that were added explicitly and the labels that were added implicitly though implication rules. For implied labels, see section 3.2.6 Label taxonomies. Using implied labels in diagrams is discussed in section 3.3.4 Diagrams and implied labels.

Creating a diagram

Make sure the workshop "demowork.wrk" is open and active on the Abundantia Verborum desktop and make sure its filter is empty! Then choose "Workshop | Set Graph..."! The Graph Settings dialog box appears. Set the graph type to "Venn diagram"! In order to do so first click on the box with the little triangle that points downwards and then click on the item of your choice. Next make sure that "Disable Threshold" is checked! Finally you have to set the most crucial part of the graph settings, namely the displayed labels. Whatever settings you choose for graph type and for display threshold, as long as you have not selected any displayed labels, there is no diagram.

The displayed labels are the building blocks of your diagram, the criteria for making the piles. Click on "Add Group"! In the Select Label Group dialog box select COMPAR and click "OK"! Back in the Graph Settings you see that the labels of the group COMPAR are now the current displayed labels. Click "OK" to apply the new graph settings!

The workshop window now looks quite different from what it looked before. Three of the four panels in the window have changed. You might want to maximize the window to obtain an optimal overview. The "Display threshold" panel, the middle one on the left, signals that the display threshold is off. For the moment, we leave this information for what it is. The "Displayed labels" panel, the bottom panel on the left, lists the displayed labels, and assigns a number to each. These numbers reappear in the actual diagram in the "Graph" panel on the right. In this diagram the ellipses, or rather circles, are named 1, 2 and 3. Their full names can be looked up in the "Displayed labels" panel. Instead of using dots for representing elements, gray scales are used to represent the 'population' of regions. Dark regions contain many observations, light regions few.

What we can read on the diagram is that the largest group of observations has the label COMPAR:POS and that a second, smaller group has the label COMPAR:COMP. Further we see, on the basis of the white regions, that no observations contain the label COMPAR:SUP, that no observations contain more than one of the displayed labels, and that there are no observations that contain none of the displayed labels. These facts, of course, do not come as a surprise. They're just a first exercise in interpreting diagrams.

Let us create another diagram. Click, with the left mouse button, anywhere in either the "Display threshold" panel or the "Displayed labels" panel! You'll see that the Graph Settings dialog box appears. Indeed, clicking on either one of these panels is a shortcut for "Workshop | Set Graph...", like clicking on the "Workshop filter" panel is a shortcut for "Workshop | Set Filter...". In the Graph Settings dialog box remove the label COMPAR:COMP from the list of displayed labels! You do this by clicking on it to select it, and then clicking on "Delete selection". After you have done this, click "OK". The new diagram displays the distribution of the observations over the sets COMPAR:POS and COMPAR:SUP. In contrast to the first diagram, now the region outside of both sets is non-empty. You would expect this region to be gray, but instead of making everything outside of the sets gray (which would hide the numbers of the sets), only a small triangle in the upper lefthandside corner is made gray. So this is an important little corner of the picture.

In the beginning of this section we spoke of a drawback of Venn diagrams. This drawback shows up when you have more than three sets being displayed. Open the Graph Settings dialog box again and remove all displayed labels with "Clear All"! Now make the labels listed below with "Set Labels"! In the Select Labels dialog box you first have to select the correct group before you can select a label. Selecting and deselecting a label is done by clicking on it. Selected labels have a plus sign before their name (cf. the Set Observation's Labels dialog box in 3.2.5 Manually adding labels). The order in which you select the labels determines the order in which they will appear in the list of displayed labels.

When you have done this, click "OK" in the Graph Settings dialog box to create the new diagram! At first sight there is nothing wrong with the diagram that appears, but if you look more carefully, you notice that two regions are missing. There is no separate region for observations that do have labels 1 and 3, and do not have labels 2 and 4. Neither is there a region for observations that do have labels 2 and 4, and do not have labels 1 and 3. There are only 14 regions, while there are 16 different possible configurations of having and not having four labels. The problem increases when there are more sets. For five sets 10 out of 32 possible regions are missing. For 6 sets 32 out of 64 regions are missing. Even if the program would try out more irregular set shapes, it would still be faced with the same fundamental problem. The problem is inherent to the diagram type: it is not made for displaying lots of information at the same type. Note finally that the program warns you for the problem. Whenever a Venn diagram is incomplete, the title of the "Graph" panel reads "Graph (incomplete):". Whenever it is complete, the title reads "Graph (complete):"

Hasse diagrams

Hasse diagrams are used in mathematics to represent a.o. lattices and Boole algebras. They also show up in other sciences. For instance, they are used to represent crystal structures. In Abundantia Verborum they are used to chart label configurations. They turn out to be a valuable alternative to Venn diagrams. Before we look at them, first some theory.

Hasse diagrams are graphs, consisting of nodes, represented as circles, and links, represented as lines between the circles. In Abundantia Verborum the nodes of a Hasse diagram represent, and carry the name of, the different subsets of the set of displayed labels. There are as many nodes as there are different subsets. Suppose the set of displayed labels would contain the following labels:

Then there would be sixteen nodes in a Hasse diagram, namely: Remember that whatever set you take, the set itself and the empty set are always taken to be subsets of this set. In our example these 'marginal' cases respectively are the first and the last in the list. The position of the nodes and the links between the nodes are such that a link between A and B in combination with the fact that A is depicted higher that B indicates that A represents a superset of B. Redundant link are not depicted. If A represents superset of B, and B represents a superset of C, then a direct link between A and C is not depicted. The relation can be inferred from the links between A and B and between B and C.

In Abundantia Verborum the nodes of a Hasse diagram are the counterparts of the regions in the Abundantia Verborum Venn diagrams. Just like these regions, nodes too are thought of as containing observations. More precisely, a node contains those observations of a (filtered) workshop that have all the labels in the node's name and that do not have any displayed labels that are not in the node's name. As in Venn diagrams, the 'population' of a node is represented by a gray scale, dark being crowded. We conclude the 'theory' with the remark that what was said for Venn diagrams about filtered out observations and about implied labels also applies to Hasse diagrams, which meant that, the former are not, and the latter are taken into account when calculating frequencies.

Back to practice. Open the Graph settings dialog box, clear the list of displayed labels, and once again add all labels of the COMPAR group with "Add group"! As you start using the program for your own work you'll notice that you typically will want many if not all labels from the same label group in a diagram. Therefore the button "Add group" is often a convenient tool. This being said, nothing forbids you to mix labels from different groups in the same diagram. But such diagrams are likely to be more difficult to interpret.

Select the graph type "Hasse diagram" and click "OK" to activate the new diagram! This diagram is the counterpart of the first Venn diagram we tried out above. Notice the similarity of the diagram with the icon of Abundantia Verborum. For a moment think of what is depicted as a three-dimensional cubical construction with little spheres attached to its corners. Let us call the direction going from node {} to node {1} the heigth of the object, the direction going from node {} to node {2} the depth of the object and the direction going from node {} to node {3} the width of the object. In section 3.2.1 Using labels we introduced the following metaphor: "You can think of a workshop as an n-dimensional space, n being the summed cardinality of all label groups. In each dimension there are two possible locations, namely point 0 for 'label absent' and point 1 for 'label present'. The observations are spread over this n-dimensional space in such a way that their location corresponds to which labels they do and which they do not have." Hasse diagram can be looked at in a similar way, with this difference that in the diagrams only a few dimensions are displayed so that the information becomes representable. In our example heigth is the dimension of COMPAR:POS. Being in some corner of the ceiling of the construction implies having the label COMPAR:POS. Being somewhere on the floor implies not having this label. In a similar way depth is the dimension of COMPAR:COMP and width is the dimension of COMPAR:SUP. At first it will seem counter-intuitive to map COMPAR-information on a compound scale consisting of three component binary scales, rather than on one scale with three positions POS, COMP and SUP. The reason for mapping information related to a group on a compound scale that bundles the component scales of the individual labels of the group, is that it yields one general type of representation, applicable to all sorts of groups or other label sets. For instance, it allows for observations to have compound values, or to have no value at all for a particular group. Compound values are often interesting for linguistic information. We already have used them in the SAID-OF and the SEM groups.

Schematic diagrams

Now click on the button in the speed bar that looks like a small tree diagram! This button is the speed bar equivalent of "Workshop | Set Graph...". We introduce the speed bar button now because the representation in its icon is a midget version of the type of diagram we treat next, the Schematic diagram. In the Graph Settings dialog box clear all current displayed labels and then add the group SEM! Next select "Horizontal Schematic Diagram" as graph type! Finally make sure the "Display Threshold" is enabled! We're aware of the fact that display thresholds have not been explained yet. Please be patient. We'll refer back to this passage in 3.3.3 The display threshold. Finally Click "OK"!

A tree-like diagram appear. You can think of this diagram as a rudimentary attempt by the machine to divide the observations in the workshop into different readings of "old", on the basis of their SEM labels. You could call it a first proposal for a dictionary entry structure for the lemma "old". According to the diagram there are six basic readings. They are the sons of the root of the tree, namely:

The diagram just shows numbers, but like for the other diagram types you can look up in the "Displayed labels" panel which label the numbers stand for. There you read that the readings are: Further towards the leaves of the tree the subdivisions of the readings are given. The first reading, node "1", has one son that represents a more specific sub-case: "1,5". The name of this node can be paraphrased as AND(SEM:having old age, SEM:turned bad). Paraphrasing with filter syntax is appropriate here, because in a sense the nodes in a Schematic diagram function as an additional filter, on top of the one (if any) specified in the "Workshop filter" panel. You can think of node "1" as a container of all observations in the (filtered) workshop that are matched by the additional filter SEM:having old age, and likewise you can think of node "1,5" as a container of all observations in the (filtered) workshop that are matched by the additional filter AND(SEM:having old age, SEM:turned bad). The second reading has no further subdivisions. The third one has two, namely AND(SEM:not most recent type, SEM:outdated) and AND(SEM:not most recent type, SEM:no longer existing). The latter can also be seen as a subtype of reading D, SEM:no longer existing. This is indicated by the red link. The link is red to indicate that it violates the tree-nature of the graph. In a proper tree a node cannot be the son of more than one father. The presence of the red links in the example indicates that the distribution of SEM labels in the workshop does not reflect a purely classical, i.e. hierarchical semantic structure. The readings overlap. To finish the description of the example: reading D has two subdivisions; reading E has none; reading F finally has one, which also is a subclass of reading D. Note that the fact that the diagram depicts this subclass in the subtree of reading D, and only links it with an oblique red link to reading F, should not be interpreted as an indication of preferred classification. It is an arbitrary choice by the program, and could just as well have been the opposite way. The same is true for all red links in Schematic diagrams.

So much for the informal presentation of the diagram type. How is the diagram constructed ? Any Schematic diagram has at least one root, namely the node root which is always depicted. The maximum number of nodes that a Schematic diagram can have is equal to the number of nodes the corresponding Hasse diagram has (with display threshold off). If we would have used the following displayed labels...

..., which are the ones we used in the Hasse diagram example, then the candidate nodes would have been the following, listed in the same order as their counterparts for the Hasse diagram have been listed above: The complete graph is tree-shaped (although, due to the red links it not always a proper tree). Links between nodes are related to the names of the nodes: if the set of numbers in the name of node A is a strict subset of the set of numbers in the name of node B, then B must be depicted as a descendent of A. As in Hasse diagrams, redundant links are not represented.

An important difference between Schematic diagrams and the other two diagram types is that Schematic diagrams inherently rest upon the display threshold principle, which is the principle that candidate pieces of the graph that do not meet a specific condition are not displayed. In the current diagram in the program there were 256 candidate nodes, since there are eight displayed labels, and a set with eight elements has two to the power of eight, which is 256, subsets. So why are only 11 displayed? A lot of information, if not most, in Schematic diagrams is in the presence and absence of nodes. For example, restricting our attention to reading A for a moment (and shifting again to a less formal level of explanation), what does it mean that both the nodes "1" are "1,5" are displayed in the subtree of this reading, and no others? First of all, it means that in the (filtered) workshop displayed label 1 does co-occur with displayed label 5 in at least one observation. Otherwise the leaf node "1,5" would not be there. Second, it means that in the (filtered) workshop displayed label 1 does not co-occur with any other displayed label but 5. Otherwise node "1" would have other descendants than "1,5". Finally, it means that there are observations that have displayed label 1 and do not have displayed label 5. Otherwise node "1" would not be there and node "1,5" would be attached directly to the root. Of course, the lack of cross links to other readings, and moreover the fact that neither 1 nor 5 occur anywhere in the names of nodes outside of the reading A subtree, are also informative: these facts show that reading A is clearly isolated from the other readings. Summarizing all this we conclude that reading A consists of the cluster 1+5, in which 1 seems to be obligatory and 5 seems to be optional, and that the reading does not share any labels with other readings. The rest of the diagram can be interpreted in a similar vein. More technical detail about the display threshold are given in 3.3.3 The display threshold.

The second important difference between Schematic diagrams and the other two diagram types is that in Schematic diagrams observations do not necessarily have a unique location. In Schematic diagrams the gray scale of node A reflects the percentage of observations in the (filtered) workshop that are matched by the additional filter that paraphrases A. To take reading A again as an example, it is clear that all observations matched by AND(SEM:having old age, SEM:turned bad), the paraphrase of "1,5", are also matched by SEM:having old age, the paraphrase of "1". The general rule is that in a Schematic diagram the inhabitants of node A by definition also inhabit all ancestors of A. Or in other words, the population of a node is a (specific) subclass of the population of its father node. By extrapolation the root node represents all observations in the (filtered) workshop, but this small fact doesn't become relevant until section 3.3.6 Zooming in on diagram parts.

We've come at the end of the presentation of the different diagram types. Before you go to the next section, save the workshop by clicking on the first button in the speed bar (the one with an arrow pointing to a disk) and then close the workshop!


Back to table of contents