The corpus and the citation archive - peaceful coexistence between the best and the good?

Christian-Emil Ore
University of Oslo
The Documentation Project, p.o.box 1123, Blindern, 
N-0317 Oslo, 
Norway

Abstract: This paper focus on how to save the information in a traditional citation archive for the computerized world. The sample archive is a citation archive at the Department of Lexicography at the University of Oslo, Norway. Originally the archieve was to be keyed and given a SGML markup. Due to reasons discussed in the paper this approach was abandoned. The archive is now converted into a indexed facsimile data base which seems to be a better solution both from a scientific and from an economic point of view. The research, the program development and the actual conversion of the information in the archive forms a sub project of the Documentation project, which is a collaboration between the four Norwegian universities.

1 Introduction

The citation collection in the form of a card archive used to be the standard way in all lexicographic departments to systematically store examples of word usage and additional information of words. The introduction of the computers has brought completely new tools to the lexicographers. The computer-based lexicographers use a text corpus and associated computer tools for making different kinds of concordances, for tagging the running words with word class markers and program packages for statistical analysis of the texts. However, world wide there are quite a lot of dictionary projects relying on traditional citation archives and not on text corpora. Among these we will find large long term national project but also newer ones like a newly started project for a national dictionary for Latvian. For a well established project it may not be regarded as possible to convert from the cards to the electronic text corpus based methods due to costs in time and money. Thus the introduction of the computers has created a big difference in the methods of getting information between those using the corpora and those using the traditional card archives. It should be added that there also exists a difference of opinion on the methodological level. For many, a citation collection is an unsystematically collected set of information or a partial concordance and is an example of the technical level of the past. Others consider their citation collection as a chest of treasures, where each citation is selected with great care, while in a corpus it can be difficult to include texts with the special or rare words of the language in question. The opinion of the author of this paper is that the corpus method is the most rational and best way of collecting language information. However, in this paper I will focus on how to save the information in a traditional citation archive for the computerized world. The sample archive is a citation archive at the Department of Lexicography at the University of Oslo, Norway. The research, the program development and the actual conversion of the information in the archive forms a subproject of the Documentation project, which is a collaboration between the four Norwegian universities. The objective is to make databases out of the paper-based archives of the so called collection departments at the universities in Norway ranging from the Viking ship museum and archaeological museums via folk music and place names to the lexicographic departments. The actual conversion is mostly done by unemployed persons engaged in special training schemes based on a 50-50 work and education program but also unemployed people temporarily engaged in full time work (financed by the government). The work is done in specially designed education and conversion centres but also in smaller groups. For the time being there are 220 persons engaged. The use of unemployed persons as working force introduces some complications which might have been avoided using ordinary staff. An introduction to this project was given at the ALLC-ACH 1993 conference at Georgetown university and a more detailed description can be found in volume 28 no 6 of "Computers and the Humanities".

2 The task and the first project plan

2.1 The task

The archive at the Department of Lexicography at the University of Oslo contains 3.2 million paper cards collected by some 600 persons during the last 60 years. On an international basis this is a small or medium sized citation archive. Even so, the archive is so large that any kind of manual inspection of all the cards will at least take a full man year. The archive is used by the editors of Norsk Ordbok, the national dictionary of the Norwegian dialects and the written standard variety Nynorsk. For the time being the editors work on the beginning of the letter 'H'. The history and the size of the archive imply that nobody knows its contents precisely. The archive is heterogeneous and contains oral citations with or without more or less standard phonetic transcription, additional remarks about special usage of the word in question, about the meaning, about where the citation was heard as well as citations from all kinds of printed matter with or without additional comments by "the author" of the actual papercard. The cards are hand-written, typed or both and may have pieces of newspapers glued onto them. The objective of the subproject is to create a computer based system which as a minimum contains all the information in the citation archive.

2.2 The project plan

The Department of Lexicography did not allow the original cards to be sent to the conversion centre located in Mo i Rana, a small, former industrial, town near the Arctic circle, a thousand kilometres to the north of Oslo. For this reason the cards had to be photocopied. This restriction gave us the opportunity to manually sort out cards with just a citation from a well defined printed source like novels, old dictionaries etc. These cards were to be replaced by electronic full text versions of the dictionaries and word lists and a corpus. The rest of the cards were to be keyed in. The reason for removing the cards based on dictionaries and word lists should be obvious. The simple excerpt card based on literary was removed because corpus based concordances seems to be a better solution both in efficiency and quality. When most of the cards were made the lexicographers used to read the books and mark words and then some other person wrote out the actual card. Some of the lexicographers chose the first occurrence and never more than one from each book. Thus the piece of context on such a card need not to be the most interesting or useful sample of the actual use of the (head)word. We soon discovered that these simple excerpt cards were based on a too large number of printed sources to make it feasible for the non expert to sort out all such cards in a reasonable time. But by the use of a simplified list some student assistants sorted out approximately one third of the cards. The conversion should be done by simply keying the cards using a text processor and then encode the resulting text according to a mark-up scheme. It was also decided to start from the letter L and do an evaluation after the letters L, M and N (300 000 cards) were processed. A grammar, in fact an SGML DTD, was developed to support the structure encoding of the cards. Such a grammar enables an automatic loading of the data into a relational or hierarchical database, but can also be used to support searches in a full text retrieval system. The data modelling was done by a group consisting of three experienced project assistants and a programmer. During three months the assistants analysed the contents of the cards and wrote a 70 pages encoding manual. The assistants also trained the encoders in Mo i Rana The conversion and encoding work started in January 1993.

Figure 1: The structure of a database combining the slip archive and the tagged text corpus

2.3 Problems with the original plan

There will of course always be a trade off between the speed of the data entry and the complexity of the tagging done by the encoders. During 1993 the project group experienced that the work was of an acceptable quality but the progress was slow, approximately half of the estimated speed. Several factors may be the reason for the slow progress. Some such factors are: unreadable photocopies, unintelligible handwritings, a too complicated encoding scheme and low motivation among the encoders. If the latter is correct, it is a result of the former and perhaps of circumstances outside the actual project. Many of the photocopies were of poor quality. The encoders may have produced a better result (and got fewer reasons for complaints) if the original cards had been used or if we had invested in a better photocopier. However, the costs of the pagination, sorting and the copying of the cards were already quite high (8 cents pr card or $240,000 for the entire archive). The paper copies were only needed in the conversion process and then they were sent to recycling or thrown into the garbage. This also created a push to find a more forest saving way of making the necessary copies of the cards.

Figur 2: Facsimile of word slip

The encoding scheme was simplified during the first year. It was also suggested that the typists (encoders) should do even less encoding and leave most of the encoding to the project assistants. Another suggestion was to make the encoding system extremely simple. But it does not seem to be the right thing to do (as I will explain later on). It is not correct to claim that bad copies and a complicated encoding scheme were the sole reasons for the low productivity. As a small-scale pilot project a few thousand cards were converted by two small groups of unemployed people. But the full-scale conversion was started as the practical part of a completely new type of training scheme for unemployed. In fact we started two new projects simultaneously, which in the sharp light of the afterthought cannot be called an ideal situation. Still the impression is that it is possible to use people without any previous experience to do rather complicated analysing and encoding work, but not without intensive supervision. Our experience from other subprojects indicates this, especially the court protocol project where a small group (7 persons) transcribe protocols, handwritten in Gothic letters, from the 17th century.

2.4 The new way to register the archive

The slow progress forced us to either stop the project immediately or find a completely different way to create a database of the citation archive. The idea for a solution was taken from the Norwegian banking system where all giro forms are microfilmed and the transaction information is stored in a database together with a reference to the corresponding microfilm picture. In our case we have added electronic pictures. All the 3.2 millions citation cards have been scanned and microfilmed by Kodak Norway Inc. The electronic pictures have been stored on CD-ROMs, approximately 40 000 on each disk. The encoders in Mo i Rana now use a tailor made registration application. For each citation card they type the headword, the word class marker and the source of the information on the card. In the resulting database it is possible to make queries among the headwords, the word class markers and the time of creation, the type, the place and the title of the source of the citation. When we decided to reorganize the registration in this way, we could do so because it did not conflict with the original plan, rather it implied a registration in several steps. After a more detailed analysis of the project and the archive it is not necessarily so that a complete transcription of the cards is the optimal solution. I shall explain this in more detail in the second part of the paper.

2.5 Why not facsimiles from the beginning?

At an early stage in the Documentation project (spring 1991) it was suggested by Per Kristian Halvorsen at Xerox PARC that all the paper slips should be scanned. At that time the idea seemed unrealistic due to the scanning costs and the cost of on line storing media (40 GigaBytes). It was definitely far beyond the funding of the Documentation Project. From the lexicographic point of view it looked pointless to make expensive images of cards only to serve as background material for a (huge) dictionary. It is also clear that the lexicographers would like to be able to search for words, sentences and so on. Hence the only reasonable solution in 1992 was to photocopy the cards and key and tag the content. Today a modern photocopying machine is a scanner combined with a printer and the situation is completely different. At least the photocopying of archive material should be replaced by scanning and storing the images in electronic media.

3 What do we lose and what do we gain in the text/facsimile data base?

3.1 Electronic text versus facsimile

Word processors (Word Perfect, Microsoft Word) store texts in a rather compact way with a little more than one byte used pr character. This is a simple way of storing texts which enables us to search for words and strings. If the sole purpose is to read a text stored on a computer, it is sufficient to store an image (facsimile) of the text. The reader should not care or need to know how the displayed characters are stored in the computer . Of course it has to be taken into consideration that storing a text as an electronic image consumes much more space and computing resources than an ordinary electronic text (at a ratio 100:1 at least). However, as in certain cases it can be justified to publish expensive facsimile editions of manuscripts, it can be justified to store electronic facsimiles in a data base.

<NSET NO=60605>
<HEADWORD GRM=v>leta</HEADWORD>
<WORDFORM>lita<WORDFORM>léta</WORDFORM>
<COMMENT>tr</COMMENT><INFLECTION>a - a
<DEFINITION>farga, gjeva farga
<CIT T=USET>
<WORDF GR=inf>líta</WORDF>gad'n
<COMMENT>heril litargarn (lítagad'n) n. </COMMENT></CIT>
<CIT T=SENT>eg veit ikje kor da čr 
<WORDF GR=prp>líta</WORDF><EXPL>um eit dyr o.l.</EXPL>
</CIT></WORDFORM>
<SOURCE>VossNLid NO</SOURCE>

Figure 3: Encoded version of the word slip shown in figure 2

3.2 Text data base or text/image data base

The paper cards in question are made by people with or without academic training during a period of 60 years and are rather heterogeneous. The common features seem to be only a headword, a body and a piece of source information at the bottom. It has been claimed that the encoding of the cards should follow such a scheme. It is easy to learn and easy to follow. According to this scheme the cards would have been accessible as searchable electronic text. In addition headword and source could be keys in a database. Such a simple mark-up scheme did not appeal to the planning group at the start of the project. For many cards it is necessary to see the actual position of the text on the card to understand the meaning (or bindings).

3.2.1 The cards and their content

It is not clear that it is worth to try to make the content of many of the cards accessible as electronic text at all. Below we list some arguments against transcribing all the cards. 1) Although the encoding scheme covers all the different ways to write a card, some cards are so complicated or unsystematically written that the encoder has to interpret the information on the card him- of herself to be able to make a correct encoding. 2) There are several phonetic alphabets as well as many inconsistent attempts at such alphabets 3) The handwriting on many cards is hardly legible. 4) Many cards are put into the archive as "reminders": "There exists a word so and so and it can be found in some printed text". 5) 30% of the cards contain nothing more than an excerpt from word lists, dictionaries or frequently excerpted books. 6) 5 % of the cards have illustrations. The points 1-3 describes the technical quality of the archive. If for each card the headword, its word class, the source and perhaps the actual word form cited on the card are registered in a data base together with an image of the card, it is possible to read the original card from the screen. The "reminder cards" contain exactly the information categories headword, word class and source. It is of little use to encode the (attempts at) phonetic transcription of the cards if the transcription in the first place is inconsistent and/or unclearm as to which phon(em)e each character denotes. It is hard to think of meaningful queries in such a material. Thus it seems to be a better solution to use the facsimiles to read the phonetic transcriptions and perhaps tag the cards containing such information for later purposes The simple excerpt cards are registered in the same way as the more "interesting cards". The excerpt cards will have more or less the same role as the "reminders" since the pieces of context on the card in many cases are not very interesting or useful. The reason for not sorting out these cards is purely practical. The sorting will take at least one year for a skilled person. It should be mentioned that it is possible to transcribe and tag all the cards. (We have shown this). But from the discussion above one has to conclude that in many cases it will be necessary to have the possibility to consult the original card. Hence it seems a better solution for each card to register the headword and its word class and the source and store this information together with a facsimile of the card in a data base. This can of course be seen as the first step in the process towards a real text data base, but as I will try to show, the text/facsimile data base is an interesting product on its own.

3.2.2 What we gain

We lose the possibility of searching for comments and the actual citations. These deficiencies are compensated for by the collection of electronic texts which already have been put in by OCR as a replacement of the cards based on well defined printed sources (see above). When it was decided to go for the facsimile database the project group started to systematize the source (or citation) lists. The list now consists of a little less than 3 000 printed sources (each newspaper and series of periodicals is considered a single entry). In addition we have identified more than a thousand card writers (persons making the excerpt cards from marked books as well as persons making cards based on their own (scientific or unscientific) observations. The latter type of cards are called "original cards" in the internal jargon of the dictionary team. As mentioned earlier the encoders now use a tailor-made registration application. For each facsimile, that is archive card, they type the headword and word class. There may be several headwords indicating variant spellings in normalized Norwegian. They also type the source as they find it on the card and tick a field if the card has an illustration. Then they try to match the information in the source against entries in the two mentioned lists, that is the list of printed sources and the list of card producers. Thus a card with a citation is linked to the identification number of the printed source. A card where the name or number of the card writer is written is linked to the right person by his or her identification number in our list. As a consequence it is possible to use all the possible search keys in both lists in a search among the cards, that is, date of publication, kind of publication, author and/or translator and for the second list, age, profession, team member or plain card writer, place in Norway for oral citations or dialect information. These extra searching possibilities we would also have got in the original solution, though we planned to use free text searching techniques due to the extra time such a linking demands. What we do not get in the facsimile data base is the possibility to search for the extra information on the so called "original cards" mentioned earlier. Some of the excerpt cards and most of the "original cards" contain valuable comments about the use of the word, local variants etc. Thus these cards contain information which it is not possible to get from the corpus or a concordance. Here the 150 000 cards which have already been encoded and proof-read according to the original plan are of great help. These card are tagged according to the already mentioned SGML-DTD. This mark-up, the source lists and the list of the card writers makes it possible to find most of these interesting cards in the remaining 3 million cards in the facsimile data base. Cards with citations from printed sources may contain extra comments but rarely interesting ones. Most of the extra information can be deduced from the authorsŐ place of living (list 2) or from the kind of literature in question (list 1). A few card producers almost always add interesting comments. Their cards are identified from list 2. Cards without information about printed source but with the card producersŐ name are likely to contain original dialect information. Among the card producers some, particularly teachers (profession list 2), act as if they were experienced philologists and trace the word almost back to Sanskrit. The extra information on their cards is not very useful. Some tests indicate that at least 80% of the cards with relevant extra information can be picked out by using the two lists, asking the lexicographers and using the 150 000 fully tagged cards as a control.

4 Conclusion

The facsimile and text database are in our opinion a better solution than the originally planned database of transcribed cards. The indexed facsimile database ensures that all the information in the original archive is included in the database. The electronic text collection is a huge leap towards a text corpus. The combination seems to be an entirely new way of combining the good and the best. Finally, we should admit that the technological progress with the introduction of relatively cheap big volume, high speed scanners came to our rescue as a deus ex machina. But now the technique is tested for a real lexicographic archive and should be applicable to all the similar archives existing at various lexicographic departments.