Corpus annotation

UCREL has expertise in applying the following kinds of annotation to corpora:

Part-of-speech (POS) tagging

Part-of-speech (POS) tagging, also called grammatical tagging, is the commonest form of corpus annotation, and was the first form of annotation to be developed at Lancaster. Our POS tagging software, CLAWS (the Constituent Likelihood Automatic Word-tagging System), has been continuously developed since the early 1980s.

Although the structure of CLAWS has seen some changes since the first version was produced, it still consists of three stages: pre-edit, automatic tag assignment, and manual post-edit. In the pre-edit stage the machine readable text is automatically converted to a suitable format for the tagging program. The text is then passed to the tagging program which assigns a part-of-speech tag to each word or word combination in the text. To assist it in this task, CLAWS has a lexicon of words with their possible parts of speech, and a further list of multi-word syntactic idioms (e.g. the subordinator in that). These databases are constantly updated as new texts are analyzed. To deal with words which are not in these databases, CLAWS uses various heuristics including a suffix list of common word suffixes with their possible parts of speech. Because one orthographic form may have several possible parts of speech (e.g. love can be a verb or a noun), after the initial assignment of possible parts of speech to the words in the text, CLAWS uses a probability matrix derived from large bodies of tagged and manually corrected texts to disambiguate the words in the text. The matrix specifies transition probabilities between adjacent tags, for example given that x is an adjective, what is the probability that the item to its immediate right is a noun? Again, these probabilities are constantly updated from new data. CLAWS tracks through each sentence in turn applying these probabilities. Finally manual post-editing using a special tag editor may take place if desired to correct fully the machine output. The CLAWS system enjoys a success rate in the region of 96%-97% on written texts, and is also successful, though to a slightly lesser degree, on spoken texts.

Several tagsets have been used in CLAWS over the years. The CLAWS1 tagset has 132 basic wordtags, many of them identical in form and application to Brown Corpus tags. A revision of CLAWS at Lancaster in 1983-6 resulted in a new, much revised, tagset of 166 word tags, known as the `CLAWS2 tagset'. The tagset for the British National Corpus (C5 tagset) has just over 60 tags. This tagset was kept small because it was designed for handling much larger quantities of data than were dealt with up to that point (see Leech, Garside and Bryant, 1994). For the BNC sampler corpus the enriched C6 tagset was used which has over 160 tags. The following is an example of CLAWS analysis using the CLAWS1 tagset:


hospitality_NN is_BEZ an_AT excellent_JJ virtue_NN ,_, but_CC
not_XNOT when_WRB the_ATI guests_NNS have_HV to_TO sleep_VB
in_IN rows_NNS in_IN the_ATI cellar_NN !_! 

the_ATI lovers_NNS ,_, whose_WP$ chief_JJB scene_NN was_BEDZ
cut_VBN at_IN the_ATI last_AP moment_NN ,_, had_HVD
comparatively_RB little_AP to_TO sing_VB 

'_' he_PP3A stole_VBD my_PP$ wallet_NN !_! '_' roared_VBD 
Rollinson_NP ._.
Key to tags

[Note that in this scheme, as in others for written text, punctuation marks count as `words', and are themselves given tags.]

For more information on the CLAWS tagger, see Garside (1987) and Leech, Garside and Bryant (1994) (html version).

Grammatical parsing

Part-of-speech tagging is often seen as the first stage of a more comprehensive syntactic annotation, which assigns a phrase marker, or labelled bracketing, to each sentence of the corpus, in the manner of a phrase structure grammar. The resulting parsed corpora are known, for obvious reasons, as `treebanks'.

Grammatical annotation at Lancaster originated with Geoffrey Sampson's manual parsing of the Lancaster-Leeds Treebank. This scheme used a detailed system of labelled brackets, which distinguished between, for example, singular and plural noun phrases (Sampson, 1987a). The second phase was the scheme adopted for the Lancaster Parsed Corpus (initially tagged by the probabilistic parser developed in 1983-86 and subsequently corrected manually), which used a reduced set of constituents (Garside, Leech and V'aradi, 1992). Currently, UCREL employs a technique known as skeleton parsing. This simplified grammatical analysis uses an even smaller set of grammatical categories. Texts are parsed by hand using a program called EPICS written by Roger Garside. EPICS speeds up the manual parsing process by storing the set of constituents which are open at a particular point in the text, and the human operator then closes these constituents or opens additional ones at appropriate points. EPICS aims at parsing with a minimum of key presses: at `full stretch' operators can parse sentences averaging more than 20 words in length at a rate of less than a minute per sentence (Leech and Garside, 1991).


[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S]

Key to Parsing Symbols. Square brackets enclose constituents above word level.

Wordtags are linked to their words by `_'. The tagset used here is the revised version of the earlier CLAWS tagset, known as the `CLAWS2 tagset'.

Unlabelled brackets indicate a constituent for which no label is provided by the annotation scheme. (Annotators are expressly allowed to identify constituents without having to identify the category to which they belong.)

Semantic tagging

Beyond grammatical annotations, semantic annotation is an obvious next step. For example, semantic word-tagging can be designed with the limited (though ambitious enough) goal of distinguishing the lexicographic senses of same word: a procedure also known as `sense resolution'.

The ACASD semantic tagging system (Wilson and Rayson, 1993) accepts as input text which has been tagged for part of speech using the CLAWS POS tagging system. The tagged text is fed into the main semantic analysis program (SEMTAG), which assigns semantic tags representing the general sense field of words from a lexicon of single words and an idiom list of multi-word combinations (e.g. as a rule), which are updated as new texts are analyzed. (Items not contained in the lexicon or idiom list are assigned a special tag, Z99, to assist in updating and manual postediting.) The tags for each entry in the lexicon and idiom list are arranged in general rank frequency order for the language. The text is manually pre-scanned to determine which semantic domains are dominant; the codes for these major domains are entered into a file called the `disam' file and are promoted to maximum frequency in the tag lists for each word where present. This combination of general frequency data and promotion by domain, together with heuristics for identifying auxiliary verbs, considerably reduces mistagging of ambiguous words. (Further work will attempt to develop more sophisticated probabilistic methods for disambiguation.) After automatic tag assignment has been carried out, manual postediting takes place, if desired, to ensure that each word and idiom carries the correct semantic classification (SEMEDIT). A program (MATRIX) then marks key lexical relations (e.g. negation, modifier + adjective, and adjective + noun combinations). The following is an example of semantic word-tagging, taken from the automatic content analysis project at Lancaster:


PPIS1     I            Z8
VV0       like         E2+
AT1       a            Z5
JJ        particular   A4.2+
NN1       shade        O4.3
IO        of           Z5
NN1       lipstick     B4

In this fragment, the text is read downwards, with the grammatical tags on the left, and the semantic tags on the right. The semantic tags are composed of:

For example, A4.2+ indicates a word in the category `general and abstract words' (A), the subcategory `classification' (A4), the sub-subcategory `particular and general' (A4.2), and `particular' as opposed to `general' (A4.2+). Likewise, E2+ belongs to the category `emotional states, actions, events and processes' (E), subcategory `liking and disliking' (E2), and refers to `liking' rather than `disliking' (E2+).

The semantic annotation is designed to apply to open-class or `content' words. Words belonging to closed classes, as well as proper nouns, are marked by a tag with an initial Z, and set aside from the statistical analysis.

For more information, see USAS web page, Thomas and Wilson (1996) and Garside and Rayson (1997).

Anaphoric annotation

The UCREL anaphoric annotation scheme co-indexes pronouns and noun phrases within the broad framework of cohesion such as is described by Halliday and Hasan (1976). The software used for this annotation (XANADU) is an X-windows interactive editor written by Roger Garside. This allows the user to move around a block of text, displaying around 20 lines at a time. The user can use a mouse to highlight any segment of the text which s/he wishes to annotate. A further window displays a set of command keys, mainly listing the different types of anaphora. Clicking on one of the buttons sets up the insertion routine for that anaphor type. Another window lists the items which have previously been highlighted, and at some point in the insertion routine the annotator is required to click on the appropriate one as the antecedent of the anaphor (Fligelstone, 1992).

An example of this form of annotation is as follows:


S.1  (0) The state Supreme Court has refused to release
{1 [2 Rahway State Prison 2] inmate 1}} (1 James Scott 1) on
bail .
S.2 (1 The fighter 1) is serving 30-40 years for a 1975 armed
robbery conviction .
S.3 (1 Scott 1) had asked for freedom while <1 he waits for an
appeal decision .
S.4 Meanwhile , [3 <1 his promoter 3] , {{3 Murad Muhammed 3} ,
said Wednesday <3 he netted only $15,250 for (4 [1 Scott 1] 's
nationally televised light heavyweight fight against {5 ranking
contender 5}} (5 Yaqui Lopez 5) last Saturday 4) .
S.5 (4 The fight , in which [1 Scott 1] won a unanimous
decision over (5 Lopez 5) 4) , grossed $135,000 for [6
[3 Muhammed 3] 's firm 6], {{6 Triangle Productions of
Newark 6} , <3 he said .

Key: The use of the same index 1, 2, ... n binds one syntactic constituent to another to which it is coreferential or semantically equivalent. In the following list, i represents an arbitrary index:

(i    i) OR
[i...]	 enclose a constituent (normally a noun phrase) entering into  
	 an equivalence `chain' 
<i 	 indicates a pronoun with a preceding antecedent 
>i 	 indicates a pronoun with a following antecedent 
{{i   i} enclose a noun phrase entering into a copular relationship 
	 with a preceding noun phrase 
{i   i}} enclose a noun phrase entering into a copular relationship  
	 with a following noun phrase 
(0) 	 represents an anaphoric barrier, in effect, the beginning of a new text.

Such annotations have potential use for studying and testing mechanisms like pronoun resolution, which are important for text understanding and machine translation.

Prosodic annotation

Prosodic annotation aims to indicate patterns of intonation, stress and pauses in speech. It is a much more difficult type of annotation to achieve than the types discussed above: it cannot be done automatically and requires careful listening by a trained ear. The Lancaster/IBM Spoken English Corpus is the only prosodically transcribed corpus in which Lancaster has been involved to date. The prosodic annotation of the SEC was carried out by two phoneticians (Gerry Knowles and Briony Williams). A set of 14 special characters was used to represent prosodic features. Stressed syllables were marked with a symbol indicating the direction of the pitch movement. Syllables which were felt to be stressed but with no independent pitch movement were marked with a circle (or bullet in the printed version). Unstressed syllables, whose pitch is predictable from the tone marks of surrounding accented syllables, were left unmarked.

Prosody is considerably more impressionistic than other linguistic levels in corpus annotation. Thus to check on the consistency of transcriptions, some sections (approx. 9% of the corpus) were independently transcribed by both transcribers. There are considerable differences in these transcriptions (cf. Wilson, 1989; Knowles, 1991), but the resulting correlations could have important implications for future research (cf. Wichmann, 1991).