In 3.2.2 Adding labels via queries we saw a technique for automatically adding labels to observations at the point of their creation, during a query run. In this section we see a first technique for adding labels to already existing observations. We proceed with the workshop "labels.wrk", which you created at the end of 3.2.2 Adding labels via queries. If you just finished the instructions of that section then the workshop "labels.wrk" is currently active on the Abundantia Verborum desktop. If this is not the case then close all windows that are currently open on the Abundantia Verborum desktop, if any, and then open the workshop "labels.wrk"!
Before we start doing some real work from within workshops, let us get familiar with the overall appearance and behaviour of workshop windows. As all windows they have three states as far as their size is concerned. They can be normal, minimized or maximized. At this moment "labels.wrk" is probably displayed as a normal window. Click on its control box, which is in its top lefthandside corner (cf. the text on "Saving the results" in 3.1.2 A First Query) ! A menu appears. From this menu we already saw the menu item "Close" for closing the window. Now we have a closer look at the items "Restore", "Minimize" and "Maximize". The item "Restore" is probably disabled at the moment. Restoring a window means bringing a maximized or minimized window back to its normal state. This option is meaningless and therefore is disabled when the window already is in its normal state. Experiment with these menu items! You may have some trouble at locating the control box of a minimized or maximized window. When a workshop window is maximized, most parts of its top edge move to the Abundantia Verborum menu bar. The control box can be found to the left of the "File" menu. The reason for this dense representation is that the purpose of maximization is to give as much screen space as possible to the body of the maximized window. When a workshop window is minimized, i.e. reduced to almost nothing but an icon, its control box is the icon.
The control box menu actually is not always the fastest way to work. We explain this technique because it is the only one that is common to all Windows versions. But depending on your Windows version you have several short cut buttons. In Windows 3.X user interfaces there is a button for each of the two states other than the current one in the top righthandside corner. In Windows 95 user interfaces the same buttons exist, but there also is a third one to their right for closing the window.
This is not the end of the story. There also are techniques for moving and resizing a window in normal state. Furthermore there is the possibility of having more than one window on the Abundantia Verborum desktop at the same time. In this case one window is the active or current window and the others are inactive. The active window is always in front, possibly covering others partially or even completely. For this situation there are several functions, most of them in the "Window" menu, for managing and for navigating between these windows. Further information can be found in the online Abundantia Verborum Help, by searching for help on "Window menu". Most if not all of this behaviour is the same in many other MS-Windows applications and in MS-Windows itself and is also documented in MS-Windows manuals.
Now that you know how it is done, maximize the workshop "labels.wrk". If you still don't get a clear overview of the workshop window, then resize or maximize the Abundantia Verborum main window. Now you see clearly that a workshop window consists of four panels, labeled "Workshop filter", "Display threshold", "Displayed labels" and "Graph".
In the "Workshop Filter" panel you read (none),
which indicates that there currently is no filter active.
Below this text you read "n observations in filter" and
below that
"n observations in workshop", with
n being some integer. The last line indicates how many
observations there are in your workshop. The other one
indicates how many of these observations are matched by the
filter, i.e. are NOT filtered out by the current filter. In this case,
since no filter was set, the numbers are identical.
The filter is a mechanism for focussing on part of your workshop, pretending for a moment that the other part does not exist. When some filter is set, practically all workshop functions completely neglect the observations that are not matched by this filter and behave as if these observations simply do not exist. There are a few exceptions to this rule: for instance when you save a workshop you also save the observations that are currently filtered out. These exceptions are necessary for not really losing these observations. An important function that, like most, does neglect all observations that have been filtered out, is "Workshop | Browse Observations...". If you choose it with a filter set, the Observation Browser pops up showing only those observations that are matched by the filter.
Choose "Workshop | Set Filter..." (the speedbar equivalent of which
is the icon that is supposed to look like a coffee filter)! The Set
Workshop Filter
dialog box appears. This environment is the filter editor.
You will notice that it is very similar to the Query Editor.
Just like queries, filters too are represented as trees. And filters too
can contain
Boolean operators. Filters too are built from the root upwards with
"create THIS OR THAT node" and "add THIS OR THAT son" instructions and
are removed
again with "delete node" or "delete subtree" instructions. The difference
is in the leaves of the tree, which in filters are either label nodes
or query nodes. Choose "create Label node"! The Select Label
dialog box appears. At the left you see the label groups
of the workshop. Currently there is one group: <query>.
At the right you see the labels in the currently selected
group. If you have followed the instructions in
3.2.2 Adding labels via queries there
are three labels.
The first label, <1>, is selected and this is exactly
what we want, so simply click "OK". We now have a filter tree, trivial but
nevertheless a tree, consisting of one node, namely
<query>:<1>. In the
Set Workshop Filter dialog box, click "OK" to accept this
filter as the new current filter of the workshop! You can
see that the Workshop filter panel of the workshop
window now reads <query>:<1> instead
of (none). You also see that "observations in filter",
the number of observations matched by the filter, is now
strictly smaller than "observations in workshop".
The filter <query>:<1> is a so-called
label based filter. Being matched by such filters depends on
having or not having some label or some label configuration.
In order to be matched by
the filter <query>:<1> an
observation must have the label
<query>:<1>. In other words, it must
have been added to "labels.wrk" when the query WORD(old)
was run.
Choose "Workshop | Browse Observations..."! The Observation Browser
opens containing all observations matched by the current filter.
At the bottom of the dialog box a message in the style of
"23 items (100% of filter; 76.67% of workshop)"
reminds you of the fact that 23.33% of the workshop is currently
invisible because it is not matched by the filter. The observations
in the list have numbers like "<0001>", "<0002>", etc.
Note that these are not absolute IDs of observations. They
order the observations as they are currently listed in the browser.
So, if the browser would contain 23 items, another 7 being filtered out,
the numbers would go from "<0001>" to "<0023>", regardless of
where the items would occur in the complete list of 30 observations.
Click on "Tag All", a function for adding a particular label to
all observations currently visible in the Observation Browser!
The Add Label to Observations dialog box appears. Instead of
selecting an already existing
label we are going to create a new one. Click on "Append New Group"!
In the New Label Group dialog box, key in the name
COMPAR and the description "comparison", and
then click "OK". Now click on "Append New Label"!
In the New Label dialog box, key in the name
POS and the description "positive", and
then click "OK". To tag all visible observations with this newly
created label, click on the button "Add Label to Observations"!
You're back
in
the Observation Browser now and the label should
have been added to the observations. To test whether your action has
indeed had any effect, double click observation number 1
in the list.
In the bottom part of the Observation Editor you see that
this observation currently has two labels, namely
<query>:<1> and
COMPAR:POS, which is good news. Instead of
manually testing whether all other visible observations
were tagged too, you can let the filter mechanism do the testing.
Close the Observation Editor with "OK" and close
the Observation Browser with "Close"! Now set
the filter to
AND(<query>:<1>,COMPAR:POS)
so that you see only those observations that have both labels
and afterwards to either
NOT(OR(<query>:<1>,COMPAR:POS))
or
AND(NOT(<query>:<1>),NOT(COMPAR:POS))
so that you see only those observations that have neither one of
the labels. Each time verify what the
Workshop filter panel reports about number of
"observations in filter" and ask yourself whether the result is
what it should be!
Note that when selecting a label in the Select Label
dialog box (and also in a few other dialog boxes, for that matter)
you first have to select, i.e. click on, the correct group before you can select
the label.
Now that you know the principle, set the filter to
<query>:<2> and tag all observations
matched by this filter with a newly created label
named COMPAR:COMP. The only stumbling block may
be creating the new label. You should know that in the
Add Label to Observations dialog box, completely analogous to what has been in the
previous paragraph
about the Select Label dialog box, you first have to select
the correct group before
clicking on "Append New Label". New labels are appended in
the currently selected group.
Although no observations were matched by the third
query <query>:<3>, and therefore there
are no observations to add a
label COMPAR:SUP to, we nevertheless are going to
create this label, to make explicit that this too is a possible
value we would have assigned to observations if appropriate, be it
that this turned out never to be the case in the workshop at
hand. Up to this point we have been
creating labels
on the fly, when we needed them, right before we actually used them.
It is also possible to create a
label now and use it later, or, as in our example,
create it now and never use it at all. In order to
create COMPAR:SUP, close all dialog boxes,
if any are open, and choose "Workshop | Browse Labels..." (or
click on the first of the two little bottles in the speed bar) ! This
brings you to the Label Browser, the environment with the fullest
access to the pool of labels in the current workshop.
From the Add Label to Observations dialog box, or from
the similar Set Observation's Labels dialog box, which will be
described in 3.2.5 Manually adding labels,
you have some limited control over your pool of labels: e.g. you can
you can change their name and description, and you can create new
labels.
In the Label Browser you have full control over your labels.
Apart from creating and editing, you can also delete
labels or create implication relations between labels.
Implication relations are discussed in
3.2.6 Label taxonomies. Full control
over the pool of labels is not given in the two dialog boxes just mentioned
because they are accessed from within the Observation Browser.
Having full control over the labels when the Observation Browser is open
would imply that you, the user, could
easily mess up the current environmental settings of the Observation
Browser. You could for instance delete a label that is used
in the current filter, hereby making this filter "un-wellformed" and making
the current contents of the Observation Browser "undefined".
This is why the program only grants you full access to your pool of labels
if the Observation Browser is closed.
To summarize what we did so far, we used label based filters,
in combination with the "Tag all" command of the Observation Browser,
to achieve what was the topic of the end of
3.2.2 Adding labels via queries,
namely "Using the right names". Back then the information
'adjective in positive form', 'adjective in comparative form' and
'adjective in superlative form' was implicitly present in
the labels
<query>:<1>,
<query>:<2> and
<query>:<3> respectively.
Now we have made this information explicit by creating the
labels
COMPAR:POS,
COMPAR:COMP and
COMPAR:SUP and assigning them to the
appropriate observations.
The technique we have used, the technique of adding labels via filters,
is a general mechanism for making implicit
information explicit. Up to this point filters have been
simple, but this need not be the case. For instance, we could have
used four queries to collect the data, splitting
up the search for the superlative cases into WORD(older)
as <query>:<3> and WORD(elder)
as <query>:<4>. In that case, if
our corpus had been a bit bigger than it is and
if some matches to these queries
had turned up, we probably would
have added the label COMPAR:SUP via
the compound filter
OR(<query>:<3>,<query>:<4>).
In practice filters will be still much more complex.
We know now how to assign a new label to all observations that do have certain and/or do not have certain other labels. In the remainder of this section we see a second type of condition a filter can be based on: queries. Using a query based filter means filtering observation on the basis of features of their "Contents" or "Origin" field, instead of on the basis of which labels they have.
Running queries on huge corpora can be very time intensive. Therefore the technique of collecting data on the basis of a whole series of queries and then adding other labels on the basis of the distribution of the query labels is not always the best conceivable approach if the corpora are large. An alternative is to be rough at the moment of data collection, and refine afterwards. First you collect the data with one or at most a few queries. These queries may over-generate; this is, they may accept data we are not interested in. Or they may not make distictions we do want to make later on. Throwing away the spurious hits and/or making further distinctions is then done from within the workshop, with a technique you could rougly describe as 'running queries on workshops'.
Let us look at yet another approach for collecting data for
studying the word "old". Close all windows on the Abundantia Verborum
desktop and create
a new workshop! Choose "Workshop | Run Query..."! Create
a query WORD(RE([eo]ld.*)), save it as "over_gen.que"
and select it as the query to be run. We already know that the
special symbol sequence ".*" in a regular expression stands for:
anything (cf. 3.1.4 Queries with Wildcards).
New is that square
brackets in a regular expression also are special symbols. They list
alternative characters.
The complete query can be paraphrased as "any word with 'e' or 'o' in
the first position, 'l' in the second position, 'd' in the
third position and anything or nothing at all after that".
Clearly this query has the risk of over-generating.
The words "eldership" and "oldestablished" are only two of the
examples that would be matched and that fall beyond the subject of our study. Run the query
on "democorp.vic", in 'keep all' mode, and save the workshop
as "over_gen.wrk"! Choose "Workshop | Sort Observations | By Contents
or Origin..."! In the dialog box that pops up, make sure the
"Contents" field is checked and after "First occurrence of..."
key in "<MATCH>" (in uppercase)! Click "OK"! You just sorted
the observations in the current workshop
by their "Contents" field, with the following
extra specification: if the "Contents" field of an observation
contains the string "<MATCH>", the part of the "Contents" field that
comes before the first occurrence of this string is disregarded by the sort algorithm. If
the "Contents" field of an observation does not contain the string "<MATCH>",
the complete
"Contents" field is taken into account. The net result of your
sort instruction is that the observations are
sorted by hit.
If you open the Observation Browser now, you have a clear overview of the different types of hits. Press the <Page Down> key a few times to go through the observations. When you've reached the end of the list, press the <Home> key to return to the top of the list. You see the program has found 40 hits, namely 2 occurrences of "elder", 8 occurrences of "elderly", 23 occurrences of "old" and finally 7 occurrences of "older". Hits we did not have in former queries and do have now are "elder" and "elderly". We leave open whether they should be included in the study of "old". But they certainly illustrate that broad, possibly over-generating queries have the benefit of occasionally revealing data that should be included in one's study and that one may not have thought of in advance.
After collecting all our data in one rough step, we can now
classify the hits with query based filters.
First open the Label Browser and create the label group
COMPAR with the labels POS,
COMP and SUP (we leave the description fields
to your inspiration)! Close the Label Browser!
Next open the Set Workshop Filter dialog box again,
but this time use yet another method: click on the
"Workshop filter" panel of the workshop window! There we
are! Choose "create Query node"! In the Add Query Node dialog
box click "Edit"! You end up in the Query Editor,
the same one we used before for constructing the queries we've been
running on corpora.
In the Query Editor choose "create RE node"! In the RE Editor key in "er</MATCH>"
and click "OK"! Save the query as "hit_comp.que"! Recall from section
3.1.4 Queries with wildcards
that regular expressions can be used outside of WORD-operators,
and that the effect is that the search engine disregards word
boundaries when looking for a match. We do not use a WORD-operator here
because we want the program to look for tags, not words:
we want it to look for the tag </MATCH>,
more precisely, the tag </MATCH> immediately preceded by "er".
The idea behind the query "hit_comp.que" is that in the workshop
"over_gen.wrk", in which the query "over_gen.que" was used for data
collection, "er" at the end of a hit is a good indicator for
comparative use of the adjective "old".
Note that the regular expression in this query does not
contain any special symbols. Regular expressions without
special symbols simply function as conventional find instructions.
Having saved the query, choose "OK" in the Query Editor to select
the query as current query in the Add Query Node dialog box.
Back in the Add Query Node dialog box you see that "er</MATCH>"
appears in the top panel and that underneath this panel, to the
right of the three buttons, "c:\abundant\user\hit_comp.que" is
displayed as name of the query.
Last but not least, make sure that "Contents" is checked in
the bottom panel, since we want to formulate a condition for
the "Contents" field of observations, not the "Origin" field.
If that is OK, click "OK" to create the query node and add it to
the filter that is under construction!
Back in the Set Workshop Filter dialog box the new query node looks like CONTENTS(c:\abundant\user\hit_comp.que).
This node represents a condition which is met by an observation
if the query engine finds at least one hit for the query at hand in the
"Contents" field of the observation, and which is not met otherwise.
A technical detail: when a query is run on a "Contents" or "Origin" field of
an observation, the complete field is taken to be
one constituent. Select
CONTENTS(c:\abundant\user\hit_comp.que) as new current
filter by clicking "OK" in the Set Workshop Filter dialog
box. Verify in the "Workshop filter" panel that the filter
has taken effect, by looking at the "observations in filter" line!
There should be 9 observations in the filter. These are
of course the 2 "elder" and the 7 "older" cases.
Now open the Observation Browser! You
see that only the "elder" and "older" cases are present.
Click "Tag All", select the label COMPAR:COMP,
and assign it to all visible observations!
Since you have acquired an overview of your observations by
sorting them and then going over their list
in the Observation Browser, you know that
there are no superlatives present. Therefore you can now set the filter
to NOT(COMPAR:COMP) and from the Observation
Browser tag all visible observations with a
the label COMPAR:POS.
Even if there are no candidates for being assigned the label
COMPAR:SUP, we nevertheless keep it in our pool of
labels as a possible value of COMPAR.
At every point we must take the effort
of making information explicit, of, so to speak, repeating what we
ourselves see immediately, slowly so that finally the program
has understood it too. The reason for this rigour is simple.
At a later stage in the study, when there will be too many
parameters for the human to oversee at once, the only information
the program will take into account in its calculations is the explicit
information.
To conclude our discussion of starting with over-generating raw queries and
then compensating with refining query based filters, let us suppose that we decide not
to include the occurrences of "elderly" in our study of "old". In other words,
we decide
that the "elderly" cases should be thrown out as spurious hits.
This too we do with a query based filter.
Click on the "Workshop filter" panel to open the Set Workshop Filter
dialog box! If, as probably is the case, the current filter is not empty, first clear it by clicking
on the root node and choosing "delete node" or "delete subtree"!
Next choose "add Query node"! In the Add Query Node dialog box
clear the top panel by clicking on "Clear"! Next click on "Edit"!
In the Query Editor create the query "RE(ly</MATCH>)" and
save it as "hit_ly.que"! Leave the Query Editor by clicking "OK"!
Back in the Add Query Node dialog box make sure "Contents" is
checked and click "OK"! Finally, in the Set Workshop Filter dialog
box accept CONTENTS(c:\abundant\user\hit_ly.que) as
new current filter!
Now only the matches that end with "ly" are visible. Open the Observation Browser and click on "Clear All"! This function permanently destroys all observations that are currently visible. Since realizing that one has just thrown out the wrong observations not seldom is a highly regrettable situation, the program tries to protect you from it and asks you to confirm that you want to go through with this -as the program puts it- 'highly destructive' operation. Click "Yes"! Close the Observation Browser, clear the filter and last but not least save the workshop!
We conclude this section by mentioning that the two bases for filters are not mutually exclusive. They can be combined in compound filters. Moreover complex filters can be constructed by combining stored queries.