Abundantia Verborum

3. Tutorial

3.2 Classifying the data


3.2.3 Adding labels via filters

In 3.2.2 Adding labels via queries we saw a technique for automatically adding labels to observations at the point of their creation, during a query run. In this section we see a first technique for adding labels to already existing observations. We proceed with the workshop "labels.wrk", which you created at the end of 3.2.2 Adding labels via queries. If you just finished the instructions of that section then the workshop "labels.wrk" is currently active on the Abundantia Verborum desktop. If this is not the case then close all windows that are currently open on the Abundantia Verborum desktop, if any, and then open the workshop "labels.wrk"!

The appearance of workshop windows

Before we start doing some real work from within workshops, let us get familiar with the overall appearance and behaviour of workshop windows. As all windows they have three states as far as their size is concerned. They can be normal, minimized or maximized. At this moment "labels.wrk" is probably displayed as a normal window. Click on its control box, which is in its top lefthandside corner (cf. the text on "Saving the results" in 3.1.2 A First Query) ! A menu appears. From this menu we already saw the menu item "Close" for closing the window. Now we have a closer look at the items "Restore", "Minimize" and "Maximize". The item "Restore" is probably disabled at the moment. Restoring a window means bringing a maximized or minimized window back to its normal state. This option is meaningless and therefore is disabled when the window already is in its normal state. Experiment with these menu items! You may have some trouble at locating the control box of a minimized or maximized window. When a workshop window is maximized, most parts of its top edge move to the Abundantia Verborum menu bar. The control box can be found to the left of the "File" menu. The reason for this dense representation is that the purpose of maximization is to give as much screen space as possible to the body of the maximized window. When a workshop window is minimized, i.e. reduced to almost nothing but an icon, its control box is the icon.

The control box menu actually is not always the fastest way to work. We explain this technique because it is the only one that is common to all Windows versions. But depending on your Windows version you have several short cut buttons. In Windows 3.X user interfaces there is a button for each of the two states other than the current one in the top righthandside corner. In Windows 95 user interfaces the same buttons exist, but there also is a third one to their right for closing the window.

This is not the end of the story. There also are techniques for moving and resizing a window in normal state. Furthermore there is the possibility of having more than one window on the Abundantia Verborum desktop at the same time. In this case one window is the active or current window and the others are inactive. The active window is always in front, possibly covering others partially or even completely. For this situation there are several functions, most of them in the "Window" menu, for managing and for navigating between these windows. Further information can be found in the online Abundantia Verborum Help, by searching for help on "Window menu". Most if not all of this behaviour is the same in many other MS-Windows applications and in MS-Windows itself and is also documented in MS-Windows manuals.

Using filters

Now that you know how it is done, maximize the workshop "labels.wrk". If you still don't get a clear overview of the workshop window, then resize or maximize the Abundantia Verborum main window. Now you see clearly that a workshop window consists of four panels, labeled "Workshop filter", "Display threshold", "Displayed labels" and "Graph".

In the "Workshop Filter" panel you read (none), which indicates that there currently is no filter active. Below this text you read "n observations in filter" and below that "n observations in workshop", with n being some integer. The last line indicates how many observations there are in your workshop. The other one indicates how many of these observations are matched by the filter, i.e. are NOT filtered out by the current filter. In this case, since no filter was set, the numbers are identical.

The filter is a mechanism for focussing on part of your workshop, pretending for a moment that the other part does not exist. When some filter is set, practically all workshop functions completely neglect the observations that are not matched by this filter and behave as if these observations simply do not exist. There are a few exceptions to this rule: for instance when you save a workshop you also save the observations that are currently filtered out. These exceptions are necessary for not really losing these observations. An important function that, like most, does neglect all observations that have been filtered out, is "Workshop | Browse Observations...". If you choose it with a filter set, the Observation Browser pops up showing only those observations that are matched by the filter.

Label based filters

Choose "Workshop | Set Filter..." (the speedbar equivalent of which is the icon that is supposed to look like a coffee filter)! The Set Workshop Filter dialog box appears. This environment is the filter editor. You will notice that it is very similar to the Query Editor. Just like queries, filters too are represented as trees. And filters too can contain Boolean operators. Filters too are built from the root upwards with "create THIS OR THAT node" and "add THIS OR THAT son" instructions and are removed again with "delete node" or "delete subtree" instructions. The difference is in the leaves of the tree, which in filters are either label nodes or query nodes. Choose "create Label node"! The Select Label dialog box appears. At the left you see the label groups of the workshop. Currently there is one group: <query>. At the right you see the labels in the currently selected group. If you have followed the instructions in 3.2.2 Adding labels via queries there are three labels. The first label, <1>, is selected and this is exactly what we want, so simply click "OK". We now have a filter tree, trivial but nevertheless a tree, consisting of one node, namely <query>:<1>. In the Set Workshop Filter dialog box, click "OK" to accept this filter as the new current filter of the workshop! You can see that the Workshop filter panel of the workshop window now reads <query>:<1> instead of (none). You also see that "observations in filter", the number of observations matched by the filter, is now strictly smaller than "observations in workshop". The filter <query>:<1> is a so-called label based filter. Being matched by such filters depends on having or not having some label or some label configuration. In order to be matched by the filter <query>:<1> an observation must have the label <query>:<1>. In other words, it must have been added to "labels.wrk" when the query WORD(old) was run.

Assigning a new label to all filtered observations

Choose "Workshop | Browse Observations..."! The Observation Browser opens containing all observations matched by the current filter. At the bottom of the dialog box a message in the style of "23 items (100% of filter; 76.67% of workshop)" reminds you of the fact that 23.33% of the workshop is currently invisible because it is not matched by the filter. The observations in the list have numbers like "<0001>", "<0002>", etc. Note that these are not absolute IDs of observations. They order the observations as they are currently listed in the browser. So, if the browser would contain 23 items, another 7 being filtered out, the numbers would go from "<0001>" to "<0023>", regardless of where the items would occur in the complete list of 30 observations.

Click on "Tag All", a function for adding a particular label to all observations currently visible in the Observation Browser! The Add Label to Observations dialog box appears. Instead of selecting an already existing label we are going to create a new one. Click on "Append New Group"! In the New Label Group dialog box, key in the name COMPAR and the description "comparison", and then click "OK". Now click on "Append New Label"! In the New Label dialog box, key in the name POS and the description "positive", and then click "OK". To tag all visible observations with this newly created label, click on the button "Add Label to Observations"! You're back in the Observation Browser now and the label should have been added to the observations. To test whether your action has indeed had any effect, double click observation number 1 in the list. In the bottom part of the Observation Editor you see that this observation currently has two labels, namely <query>:<1> and COMPAR:POS, which is good news. Instead of manually testing whether all other visible observations were tagged too, you can let the filter mechanism do the testing. Close the Observation Editor with "OK" and close the Observation Browser with "Close"! Now set the filter to AND(<query>:<1>,COMPAR:POS) so that you see only those observations that have both labels and afterwards to either NOT(OR(<query>:<1>,COMPAR:POS)) or AND(NOT(<query>:<1>),NOT(COMPAR:POS)) so that you see only those observations that have neither one of the labels. Each time verify what the Workshop filter panel reports about number of "observations in filter" and ask yourself whether the result is what it should be! Note that when selecting a label in the Select Label dialog box (and also in a few other dialog boxes, for that matter) you first have to select, i.e. click on, the correct group before you can select the label.

Now that you know the principle, set the filter to <query>:<2> and tag all observations matched by this filter with a newly created label named COMPAR:COMP. The only stumbling block may be creating the new label. You should know that in the Add Label to Observations dialog box, completely analogous to what has been in the previous paragraph about the Select Label dialog box, you first have to select the correct group before clicking on "Append New Label". New labels are appended in the currently selected group.

Although no observations were matched by the third query <query>:<3>, and therefore there are no observations to add a label COMPAR:SUP to, we nevertheless are going to create this label, to make explicit that this too is a possible value we would have assigned to observations if appropriate, be it that this turned out never to be the case in the workshop at hand. Up to this point we have been creating labels on the fly, when we needed them, right before we actually used them. It is also possible to create a label now and use it later, or, as in our example, create it now and never use it at all. In order to create COMPAR:SUP, close all dialog boxes, if any are open, and choose "Workshop | Browse Labels..." (or click on the first of the two little bottles in the speed bar) ! This brings you to the Label Browser, the environment with the fullest access to the pool of labels in the current workshop. From the Add Label to Observations dialog box, or from the similar Set Observation's Labels dialog box, which will be described in 3.2.5 Manually adding labels, you have some limited control over your pool of labels: e.g. you can you can change their name and description, and you can create new labels. In the Label Browser you have full control over your labels. Apart from creating and editing, you can also delete labels or create implication relations between labels. Implication relations are discussed in 3.2.6 Label taxonomies. Full control over the pool of labels is not given in the two dialog boxes just mentioned because they are accessed from within the Observation Browser. Having full control over the labels when the Observation Browser is open would imply that you, the user, could easily mess up the current environmental settings of the Observation Browser. You could for instance delete a label that is used in the current filter, hereby making this filter "un-wellformed" and making the current contents of the Observation Browser "undefined". This is why the program only grants you full access to your pool of labels if the Observation Browser is closed.

To summarize what we did so far, we used label based filters, in combination with the "Tag all" command of the Observation Browser, to achieve what was the topic of the end of 3.2.2 Adding labels via queries, namely "Using the right names". Back then the information 'adjective in positive form', 'adjective in comparative form' and 'adjective in superlative form' was implicitly present in the labels <query>:<1>, <query>:<2> and <query>:<3> respectively. Now we have made this information explicit by creating the labels COMPAR:POS, COMPAR:COMP and COMPAR:SUP and assigning them to the appropriate observations. The technique we have used, the technique of adding labels via filters, is a general mechanism for making implicit information explicit. Up to this point filters have been simple, but this need not be the case. For instance, we could have used four queries to collect the data, splitting up the search for the superlative cases into WORD(older) as <query>:<3> and WORD(elder) as <query>:<4>. In that case, if our corpus had been a bit bigger than it is and if some matches to these queries had turned up, we probably would have added the label COMPAR:SUP via the compound filter OR(<query>:<3>,<query>:<4>). In practice filters will be still much more complex.

Query based filters

We know now how to assign a new label to all observations that do have certain and/or do not have certain other labels. In the remainder of this section we see a second type of condition a filter can be based on: queries. Using a query based filter means filtering observation on the basis of features of their "Contents" or "Origin" field, instead of on the basis of which labels they have.

Running queries on huge corpora can be very time intensive. Therefore the technique of collecting data on the basis of a whole series of queries and then adding other labels on the basis of the distribution of the query labels is not always the best conceivable approach if the corpora are large. An alternative is to be rough at the moment of data collection, and refine afterwards. First you collect the data with one or at most a few queries. These queries may over-generate; this is, they may accept data we are not interested in. Or they may not make distictions we do want to make later on. Throwing away the spurious hits and/or making further distinctions is then done from within the workshop, with a technique you could rougly describe as 'running queries on workshops'.

Let us look at yet another approach for collecting data for studying the word "old". Close all windows on the Abundantia Verborum desktop and create a new workshop! Choose "Workshop | Run Query..."! Create a query WORD(RE([eo]ld.*)), save it as "over_gen.que" and select it as the query to be run. We already know that the special symbol sequence ".*" in a regular expression stands for: anything (cf. 3.1.4 Queries with Wildcards). New is that square brackets in a regular expression also are special symbols. They list alternative characters. The complete query can be paraphrased as "any word with 'e' or 'o' in the first position, 'l' in the second position, 'd' in the third position and anything or nothing at all after that". Clearly this query has the risk of over-generating. The words "eldership" and "oldestablished" are only two of the examples that would be matched and that fall beyond the subject of our study. Run the query on "democorp.vic", in 'keep all' mode, and save the workshop as "over_gen.wrk"! Choose "Workshop | Sort Observations | By Contents or Origin..."! In the dialog box that pops up, make sure the "Contents" field is checked and after "First occurrence of..." key in "<MATCH>" (in uppercase)! Click "OK"! You just sorted the observations in the current workshop by their "Contents" field, with the following extra specification: if the "Contents" field of an observation contains the string "<MATCH>", the part of the "Contents" field that comes before the first occurrence of this string is disregarded by the sort algorithm. If the "Contents" field of an observation does not contain the string "<MATCH>", the complete "Contents" field is taken into account. The net result of your sort instruction is that the observations are sorted by hit.

If you open the Observation Browser now, you have a clear overview of the different types of hits. Press the <Page Down> key a few times to go through the observations. When you've reached the end of the list, press the <Home> key to return to the top of the list. You see the program has found 40 hits, namely 2 occurrences of "elder", 8 occurrences of "elderly", 23 occurrences of "old" and finally 7 occurrences of "older". Hits we did not have in former queries and do have now are "elder" and "elderly". We leave open whether they should be included in the study of "old". But they certainly illustrate that broad, possibly over-generating queries have the benefit of occasionally revealing data that should be included in one's study and that one may not have thought of in advance.

After collecting all our data in one rough step, we can now classify the hits with query based filters. First open the Label Browser and create the label group COMPAR with the labels POS, COMP and SUP (we leave the description fields to your inspiration)! Close the Label Browser! Next open the Set Workshop Filter dialog box again, but this time use yet another method: click on the "Workshop filter" panel of the workshop window! There we are! Choose "create Query node"! In the Add Query Node dialog box click "Edit"! You end up in the Query Editor, the same one we used before for constructing the queries we've been running on corpora. In the Query Editor choose "create RE node"! In the RE Editor key in "er</MATCH>" and click "OK"! Save the query as "hit_comp.que"! Recall from section 3.1.4 Queries with wildcards that regular expressions can be used outside of WORD-operators, and that the effect is that the search engine disregards word boundaries when looking for a match. We do not use a WORD-operator here because we want the program to look for tags, not words: we want it to look for the tag </MATCH>, more precisely, the tag </MATCH> immediately preceded by "er". The idea behind the query "hit_comp.que" is that in the workshop "over_gen.wrk", in which the query "over_gen.que" was used for data collection, "er" at the end of a hit is a good indicator for comparative use of the adjective "old". Note that the regular expression in this query does not contain any special symbols. Regular expressions without special symbols simply function as conventional find instructions.

Having saved the query, choose "OK" in the Query Editor to select the query as current query in the Add Query Node dialog box. Back in the Add Query Node dialog box you see that "er</MATCH>" appears in the top panel and that underneath this panel, to the right of the three buttons, "c:\abundant\user\hit_comp.que" is displayed as name of the query. Last but not least, make sure that "Contents" is checked in the bottom panel, since we want to formulate a condition for the "Contents" field of observations, not the "Origin" field. If that is OK, click "OK" to create the query node and add it to the filter that is under construction! Back in the Set Workshop Filter dialog box the new query node looks like CONTENTS(c:\abundant\user\hit_comp.que). This node represents a condition which is met by an observation if the query engine finds at least one hit for the query at hand in the "Contents" field of the observation, and which is not met otherwise. A technical detail: when a query is run on a "Contents" or "Origin" field of an observation, the complete field is taken to be one constituent. Select CONTENTS(c:\abundant\user\hit_comp.que) as new current filter by clicking "OK" in the Set Workshop Filter dialog box. Verify in the "Workshop filter" panel that the filter has taken effect, by looking at the "observations in filter" line! There should be 9 observations in the filter. These are of course the 2 "elder" and the 7 "older" cases. Now open the Observation Browser! You see that only the "elder" and "older" cases are present. Click "Tag All", select the label COMPAR:COMP, and assign it to all visible observations!

Since you have acquired an overview of your observations by sorting them and then going over their list in the Observation Browser, you know that there are no superlatives present. Therefore you can now set the filter to NOT(COMPAR:COMP) and from the Observation Browser tag all visible observations with a the label COMPAR:POS. Even if there are no candidates for being assigned the label COMPAR:SUP, we nevertheless keep it in our pool of labels as a possible value of COMPAR. At every point we must take the effort of making information explicit, of, so to speak, repeating what we ourselves see immediately, slowly so that finally the program has understood it too. The reason for this rigour is simple. At a later stage in the study, when there will be too many parameters for the human to oversee at once, the only information the program will take into account in its calculations is the explicit information.

To conclude our discussion of starting with over-generating raw queries and then compensating with refining query based filters, let us suppose that we decide not to include the occurrences of "elderly" in our study of "old". In other words, we decide that the "elderly" cases should be thrown out as spurious hits. This too we do with a query based filter. Click on the "Workshop filter" panel to open the Set Workshop Filter dialog box! If, as probably is the case, the current filter is not empty, first clear it by clicking on the root node and choosing "delete node" or "delete subtree"! Next choose "add Query node"! In the Add Query Node dialog box clear the top panel by clicking on "Clear"! Next click on "Edit"! In the Query Editor create the query "RE(ly</MATCH>)" and save it as "hit_ly.que"! Leave the Query Editor by clicking "OK"! Back in the Add Query Node dialog box make sure "Contents" is checked and click "OK"! Finally, in the Set Workshop Filter dialog box accept CONTENTS(c:\abundant\user\hit_ly.que) as new current filter!

Now only the matches that end with "ly" are visible. Open the Observation Browser and click on "Clear All"! This function permanently destroys all observations that are currently visible. Since realizing that one has just thrown out the wrong observations not seldom is a highly regrettable situation, the program tries to protect you from it and asks you to confirm that you want to go through with this -as the program puts it- 'highly destructive' operation. Click "Yes"! Close the Observation Browser, clear the filter and last but not least save the workshop!

Mixed filters

We conclude this section by mentioning that the two bases for filters are not mutually exclusive. They can be combined in compound filters. Moreover complex filters can be constructed by combining stored queries.


Back to table of contents