Abundantia Verborum

3. Tutorial

3.3 Displaying statistics


3.3.3 The display threshold

In 3.3.2 Venn, Hasse and Schematic diagrams we presented the three basic diagram types that Abundantia Verborum provides for displaying statistics about workshops. The display threshold, the topic of this section, is a device for simplifying complex graphs, retaining only the most important information. To set the display threshold is to specify which information is important enough to be displayed. It can be used with all three diagram types. However, the mechanism is not equally crucial in the three types. Incidentally, the order of appearance of the different graph type in the previous section is inversely proportional to the degree to which they rest upon the display threshold mechanism, which is the topic of this section. Therefore we reverse the order this time, starting with the more important uses of the threshold mechanism.

Display thresholds in Schematic diagrams

In the last part of the previous section 3.3.2 Venn, Hasse and Schematic diagrams we saw that Schematic diagrams are inherently linked to the display threshold mechanism, and we explained how this diagram type behaves with the default threshold settings (which is: threshold on, and set to zero percent). Open the "demowork.wrk" workshop and maximize it! If you followed the instructions of the previous section in detail, the workshop should now redisplay the Schematic diagram we worked with in that section. Current graph settings are saved together with a workshop and restored when the workshop is opened again. This is also true for the settings in the other panels of a workshop window. For instance, the current filter is also saved together with a workshop and restored when the workshop is opened again. Another example is the current display threshold.

The "Display threshold" panel, the middle panel in the left part of the workshop window, should read "Display threshold: 0%". What this means, in the context of Schematic diagrams, is that the condition for admitting a node to the diagram is that its weight is strictly higher than 0%. Now, the concept weight of a node is somewhat complex in a Schematic diagram. Rather than being calculated on the basis of its own population, it is calculated on the basis of the population of its counterpart node in the equivalent Hasse diagram. Let us take node "1" as an example. The population of this node is the set of all observations in the (filtered) workshop that are matched by the condition SEM:having old age. This set includes all of reading A cases (cf. 3.3.2 Venn, Hasse and Schematic diagrams), also those in node "1,5". However, the reason for existence of the node "1" itself is not based on the frequency of reading A cases, but rather on the basis of the frequency of reading A cases not already represented by other nodes, such as node "1,5". The latter frequency can be calculated as the population of node {1}, the Hasse diagram counterpart of node "1".

Let us have a closer look at the current diagram. There are eight displayed labels, the labels of the SEM group The display threshold is set to 0%. When there are eight displayed labels, there are 256 candidate nodes, since the set {1,2,3,4,5,6,7,8} has 256 subsets. Of those 256 candidate nodes only 11 are displayed. The other have a weight equal to the display threshold of 0%. For instance, node "1,3,6" does not occur, because the Hasse diagram node "{1,3,6}" has a population of zero percent, or in other words, because none of the inhabitants of the (filtered) workshop have exactly those three out of the eight displayed labels. Actually, there is one exception to the rule. The root node of a Schematic diagram is always displayed, even if the Hasse diagram node {} is insufficiently populated (which is the case in the current diagram).

Let us neglect the really rare cases in our workshop and create a diagram that disregards all situations that occur only once. Click on the little triangle pointing to the right in the scroll bar in the "Display threshold" panel! The threshold is now at 1%, which means that a situation has to occur more than 0.32 (1 percent of all 32 observations) times. Of course, the diagram does not changes. Click on the triangle a second, a third and a fourth time! The threshold is now at 4%, which means that situations occurring only once are disregarded (1 is less than or equal to 1.28). The diagram is reduced to five nodes (apart from the root). Disregarding all phenomena that occur only once, the SEM information for "old" can nicely be classified in four non-overlapping readings. Set the threshold to 7% to further reduce the schema! Only three nodes survive (apart from the root), none of which have more than one label in their name. In other words, SEM label clusters tend to be less populated than isolated labels.

The weight calculation algorithm for Schematic diagrams we just explained is the default algorithm in Abundantia Verborum, but it is only one of the two types supported by the program. It is the most straightforward but also the rougher variant (hence the epitheton "rough" in the title of the "Graph panel"). The finer variant renders a cleaner picture, but requires more complex, time consuming calculations. Close "demowork.wrk" and open the workshop "schemata.wrk". This is a small, artificial dummy workshop, installed with the program to illustrate the difference between rough and fine calculation of Schematic diagrams. The table below illustrates the structure of the workshop.

OBSERVATIONS ASSIGNED LABELS
observation 1 dummy:1
observation 2 dummy:1, dummy:2
observation 3 dummy:1, dummy:2
observation 4 dummy:1, dummy:2, dummy:3
observation 5 dummy:1, dummy:2, dummy:3
observation 6 dummy:1, dummy:2, dummy:3
table 1 : the structure of "schemata.wrk"

The difference between rough and fine calculation is that in rough calculation the weight of a node is calculated locally, solely on the basis of features of the node itself, whereas in fine calculation the weight of a node can be influenced by its environment. More precisely, in fine calculation, whenever a node is canceled out by the threshold, its weight adds to the weight of its father(s). If the father(s) too would be canceled out, the weight is further propagated. The propagation algorithm starts at the leaves and progresses towards the root. One final subtlety in the algorithm is that care is taken that extra weight originating from a particular node does not contribute more than once to the weight of a predecessor in case there is more than one path leading from the former to the latter.

In "schemata.wrk", add the group "dummy" to the displayed labels, choose a horizontal Schematic diagram, and make sure the display threshold is on and set to zero percent! The resulting Schematic diagram is a tree that consists of one branch:

[root]---[1]---[1,2]---[1,2,3]

Now increase the threshold to 17%! Since the population of the Hasse node {1} consists of 1 out of 6 observations, which is 16.67% of the workshop, the node Schematic "1" has disappeared. The new diagram looks like:

[root]---[1,2]---[1,2,3]

Now increase the threshold to 34%. Since the population of the Hasse node {1,2} consists of 2 out of 6 observations, which is 33.33% of the workshop, the Schematic node "1,2" has also disappeared. The new diagram looks like:

[root]---[1,2,3]

Now increase the threshold to 50%. Since the population of the Hasse node {1,2,3} consists of 3 out of 6 observations, which is 50% of the workshop, the Schematic node "1,2,3" has also disappeared. The new diagram looks like:

[root]

Now choose the menu command "Options | Preferences..."! The User Preferences dialog box appears. Click on "Load" and load the settings called "finecalc.ini"! The settings specified by this file are exactly the same as the Abundantia Verborum default settings, apart from the fact that it selects the fine algorithm for calculating Schematic diagrams. Close the User Preferences dialog box with "OK"! Nothing happens yet, but all subsequent calculations will use the fine algorithm.

Set the display threshold to zero again! The resulting Schematic diagram, depicted below, is again the one branch tree we started out with above in the rough calculation. The only difference is that this time the title of the "Graph" panel signals that the program uses the "fine" algorithm.

[root]---[1]---[1,2]---[1,2,3]

Now increase the threshold to 17%! Since the population of the Hasse node {1} consists of 1 out of 6 observations, which is 16.67% of the workshop, the node Schematic "1" has disappeared. The new diagram looks like:

[root]---[1,2]---[1,2,3]

The first difference with the rough calculation comes if you increase the threshold to 34%! As was excepted, the node Schematic "1,2" has disappeared, since the population of the Hasse node {1,2} consists of 2 out of 6 observations, which is 33.33% of the workshop. But at the same time "1" has reappeared. This is because the {1,2} cases have moved up the tree, adding to the weight of "1", which now has a weight of 16.67% + 33.33% = 50%, and therefore survives the threshold. The reasoning behind the algorithm is that the inhabitants of {1,2} are also node "1" cases and by lack of a node "1,2" can give node "1" reason of existence. In rough calculation the population of {1,2} is neglected as soon as "1,2" disappears. In fine calculation these cases are 'recovered' at a higher level. In the dummy example, the new diagram looks like:

[root]---[1]---[1,2,3]

Now increase the threshold to 50%! Since the population of the Hasse node {1,2,3} consists of 3 out of 6 observations, which is 50% of the workshop, the Schematic node "1,2,3" has disappeared. At the same time "1,2" has reappeared since the {1,2,3} cases have moved up the tree, adding to the weight of "1,2", which now has a weight of 33.33% + 50% = 83.33%, and therefore survives the threshold. Node "1" finally has disappeared again, since this time it is no longer fed with extra weight, and its own weight of 16.67% is far below the display threshold. The new diagram looks like:

[root]---[1,2]

In fine calculation the picture cannot be reduced to its utter limit. The maximum threshold value is the point where no node fits the threshold in its own right. This is just a limit imposed by the current implementation of Abundantia Verborum. In theory one could continue, working with nothing but cluster nodes, i.e. nodes that take their right of existence from the weight of disappeared descendants. The next step in the example would be a threshold value of 84%. The diagram would be:

[root]---[1]

In this diagram node "1" has a weight of 16.67% + 33.33% + 50% = 100%. Making this last node disappear would take a display threshold of 100%.

The dummy workshop "schemata.wrk" is an extreme case, maximizing the difference between the two types of calculation. In practice the resulting diagrams will not differ this much, especially if one sticks to modest thresholds (which normally is the case). Nevertheless it is important to understand the conceptual difference between the two, in order to be able to judge which method is most appropriate for one's own case study. To give an example of the interpretation of the difference: applied to the SEM example we used above, fine calculation would allow readings to be bundles of dispersed related cases, whereas rough calculation would demand that readings have a strong nucleus. To conclude this subsection, let us restore the default settings of the program. Choose "Options | Preferences..."! In the User Preferences dialog box click on "Restore Default" and then click on "OK" to close the dialog box! Note that more about the topic "User Preferences" can be found in the online Abundantia Verborum Help.

Display thresholds in Hasse diagrams

We saw that the display threshold can have a drastic impact on Schematic diagrams, especially in fine calculation. Increasing the threshold can make nodes disappear and reappear again and can fundamentally change the overall shape of the diagram. In Hasse diagrams this is different. Here a display threshold behaves like an eraser. Parts of the diagram may disappear, but the rest of the picture remains intact. As in Schematic diagrams, the criterion for a node to stay in is its weight. In Hasse diagrams the weight of a node is a straightforward concept. It is the population of the node, relative to the population of the whole (filtered) workshop. If a node is not depicted, links to that node disappear too.

Close "schemata.wrk" and open "demowork.wrk" again! Open the Graph settings dialog box, and set the following displayed labels:

Also select Hasse as diagram type and disable the display threshold! Having done all this, click "OK" to look at the resulting diagram.

We combine displayed labels from different groups. This is one technique to look for interesting correlations between parameters (supposing that groups stand for parameters). In general, it is not the easiest technique. More straightforward approaches are discussed in 3.3.5 Filtered diagrams. The currently explained approach has the advantage that it compresses a lot of information in one diagram, but has the drawback that the resulting diagrams are not always easy to interpret.

Since we have 5 displayed labels and the display threshold is off, the diagram contains 2 to the power of 5, which is 32, nodes. Hasse diagrams of this complexity become difficult to oversee. To get a clearer picture, open the Graph Settings dialog box again and enable the display threshold (also make sure its value is set to 0%)! Then click "OK"! The resulting picture, which has 6 out of the original 32 nodes, still contains all the information that was in the previous one. Only, the empty node have been erased, together with all links to these empty nodes. This illustrates the major function of the threshold in Hasse diagrams. Is serves to clear up a picture. Empty nodes are only informative in case you're interested in questions such as: what are all the theoretically possible label combinations, and which of those do not occur? But if you're more interested in what does occur, you may as well leave out the empty nodes.

How can we interpret the resulting diagram? First of all, the absence of {} in combination with the fact that all present nodes have either "1" or "2" in their names tells us that the observations in the workshop are all either COMPAR:POS or COMPAR:COMP. Further you see that both these types show up in three different SAID-OF contexts, namely either SAID-OF:machine (cases {1,4} and {2,4}) or SAID-OF:person (cases {1,5} and {2,5}) or neither one of these two (cases {1} and {2}). Interesting is that both in the person cases and in the neither person nor machine cases positive use is more frequent than comparative use, whereas in the machine cases comparative use is more frequent than positive use. Perhaps people have less difficulty with qualifying machines as "older" than they have with doing the same thing for other categories, especially for people? Whatever the reason for the phenomenon, if we would want to go into it, the first step would be to corroborated it with data that have more statistical weight. After all, here we're looking at only 32 observations.

Like in Schematic diagrams, we can also use the threshold for canceling out the really rare cases. Set the threshold to 5%! This gets rid of all nodes with only one inhabitant (1/32 = 3.124, which is less than or equal to 5). Now only four nodes remain. From this diagram we can read things like: disregarding extremely rare phenomena (below 5%) the only significant use of "old", when applied to machines, is comparative use.

Display thresholds in Venn diagrams

Click on the "Graph" panel with you RIGHT mouse button! A popup menu appears, listing the different diagram types. Choose "Venn Diagram"! This is the fastest ways to switch between diagram types, and it is very practical if you want to change nothing but the diagram type. Currently the short cut doesn't help us much, because we are going to change some other graph settings as well. Our Venn diagram is incomplete, since there are five displayed labels, so let use choose a simpler diagram. Open the Graph Settings dialog box! Click on COMPAR:COMP in the list of displayed labels, and then click on "Delete Selection" to remove this element from the list! Then click on "Delete Selection" again to remove COMPAR:SUP from the list! Also make sure the display threshold is enabled and set to 0%. Finally click "OK"!

The resulting diagram has only three displayed labels, but since we already learnt from the data that in the current workshop not having COMPAR:POS implies having COMPAR:COMP (and vice versa), the diagram still contains all information that was in the 5 label Hasse diagram we used above. We invite the reader to interpret the diagram, hereby admitting that interpreting diagrams with heterogeneous labels is not straightforward.

The function of the display threshold in Venn diagrams is still more modest than in Hasse diagrams. It does not make the picture more clear (Why should it? Big Venn diagrams don't become complex, they rather become incomplete). It merely blanks out regions that, according to their weight and the threshold value, are judged to be insufficiently significant. The weight of a region is its population, relative to the population of the whole (filtered) workshop. Just increase the threshold value and see what happens!


Back to table of contents