Trip Report: CETH Summer Seminar '95


C. M. Sperberg-McQueen
25 June 1995

The Center for Electronic Texts in the Humanities at Princeton and Rutgers Universities held its fourth summer seminar earlier this month under the title Electronic Texts in the Humanities: Methods and Tools. The organizers paid me the compliment of inviting me to teach one plenary session and several breakout sessions on SGML and the Text Encoding Initiative, so I had the pleasure of attending the entire course. The two weeks of the seminar were too full to allow a comprehensive report of their content to be made, and the intensity of the participants was too great for a written summary to convey the full experience of the seminar. But it does seem worthwhile nevertheless to make at least a brief report on the outstanding impressions of the seminar, while those impressions are fresh.

In previous years, the seminar had been organized as a unified series of lectures and hands-on sessions in the Princeton computer labs. This year all participants attended the same series of plenary lectures and one or two plenary hands-on sessions, but about a quarter of the instructional time was reserved for special-interest sessions which ran in parallel tracks. By means of the shared plenary sessions, participants got a systematic overview of issues relating to electronic texts; in the parallel tracks, it was possible to pursue certain issues in more depth than has been possible at previous CETH summer seminars. Susan Hockey and Willard McCarty, the co-directors of the seminar, taught a track on Textual Analysis. Daniel Greenstein of Glasgow taught a track on Tools for Historical Analysis. Anita Lowry of Iowa led her participants through the technical, policy, personnel, and other issues of Setting Up an Electronic Text Center, while Peter Robinson (Oxford), Geoffrey Rockwell (McMaster) and I taught, respectively, tracks on Scholarly Editing, Hypertext for the Humanities, and the Text Encoding Initiative and SGML. Participants included a range of researchers and members of support staff, employed variously as librarians, faculty members, members of professional research or technical staffs, and graduate students. There was a bias toward literary subjects, but ample representation from linguists and other textual disciplines.

As usual, the participants were for the most part lodged in Princeton dormitories; these are no more Spartan than those at many other campuses, I suppose, and two weeks of communal baths can be a nostalgic experience for many of us, but the absence of air conditioning did seem particularly brutal this year. Some participants, more foresighted or less nostalgic than the rest of us, took rooms at the Nassau Inn for the duration.

In the introductory plenary session, Susan Hockey introduced the notion of electronic texts through a brief survey of text retrieval tools (word lists, concordances, indices, etc.) and the history of computer-assisted literary and linguistic studies (and their immediate forebears, going back to the word-length studies of Mendenhall at the end of the nineteenth century). She also gave a survey of existing archives and inventories. Major problems confronting potential users of electronic text are the fact that many archival sites -- perhaps most -- don't know precisely what they have got, or don't have adequate bibliographic or technical descriptions of it. Until recently, there has been no standard method of documenting the texts; as a result, there is an urgent need to compile a short list of the most urgently needed information. The potential user must also come to grips with the wide variety of encoding schemes in which material of interest may have been encoded, with the general obscurity of the copyright situation, and with the highly variable quality of existing texts.

In the afternoon, Willard McCarty outlined the issues involved in choosing whether to keyboard a text or to scan it, and demonstrated some fairly typical microcomputer software for optical character recognition. The enthusiasm of some participants was noticeably dampened when he walked them through the calculation which reckons up, for a scanner with 98 or 99% accuracy, how many errors will be found on a typical page. (If 1% of the 2000 characters on a typical typed paged are in error, there will be twenty errors on that page: not enough to get a passing grade in most first-year typing classes.) Methods for raising the accuracy rate were discussed, the simplest of which seems to be contracting out with a service bureau for a better rate (at, of course, a commensurately higher cost).

On the second day, Willard McCarty gave an introduction to basic tools, including most prominently the concordance. While a full history of the concordance remains to be written (and deserves to be written), it can nevertheless be traced readily to the middle ages, when (as one participant pointed out) it was developed when the monasteries gave way to the universities as the primary cultivators of literacy. This introductory session was followed by a hands-on introduction to TACT, the interactive concordance program developed at the University of Toronto by John Bradley. TACT is unfortunately rather fussy about running on networked machines, unfortunately, so this session proved rather frustrating to some participants, I gather.

In the afternoon of the second day, I gave an introduction to SGML and the Text Encoding Initiative. After describing the goals and syntax of SGML, I took the participants through the steps of document analysis and we performed a superficial but diverting analysis of a fragment of the "Rime of the Ancient Mariner" (which won out by a narrow margin over a fragment of Gibbon's DECLINE AND FALL OF THE ROMAN EMPIRE). The afternoon was concluded with an summary of background information on the TEI and a brief overview of its contents. After supper, a group of diehards reconvened in the basement computer lab at the Princeton Computer Center to tag the "Rime of the Ancient Mariner" using the TEI's SGML document type definition. This was the first of a whole series of improvised evening sessions added to the program ad hoc to address specific points of interest.

On the third day, Bob Hollander of Princeton discussed the Dartmouth Dante Project, in which he has managed to put into electronic form a selection of running commentaries on Dante's Comedy ranging from the fourteenth to the twentieth centuries, with software to enable them to be searched, and for commentaries on the same passage to be compared. Greg Murphy, who manages text systems for CETH, also gave an introduction to the ARTFL database at the University of Chicago. The afternoon was devoted to the parallel tracks.

Thursday morning of the first week saw an introduction to issues of scholarly editing from Peter Robinson (the author of a prominent collation program) with a panel discussion in which a number of practicing editors participated: David Chesnutt, the editor of the Papers of Henry Laurens; Hoyt Duggan, who is creating a massive parallel electronic edition of the manuscripts (and eventually of the archetypes) of the various texts of PIERS PLOWMAN; and Richard Finneran, the editor of a hypertext edition of Yeats; were joined by Douglas Kincade of Princeton University Press in a discussion of practical and theoretical issues in electronic editions and their relation to paper editions.

The first week was concluded by an exceptionally lively presentation from Dan Greenstein on the topic of structured databases, and the troubles of historians who use databases to summarize their data but wish to keep the data tied closely to the textual witnesses from which the summaries are derived. He discussed the difficulties in yoking normalized relational databases to text at such length, and with such vivid examples, that it was rather a relief when at length he began to discuss methods of bringing text and database together in a useful way. This is territory first mapped out, I believe, by Manfred Thaller's programs CLIO and (later) KLEIO, but Greenstein discussed at somewhat more length the advantages of using the feature structure notation defined in the TEI Guidelines. This surprised no one, since Greenstein was one of the principal actors in demonstrating the applicability of feature structures to areas outside linguistics.

At the weekend, some participants went home, others to New York, and still others spent much of a beautiful Saturday in the computer lab. On Sunday about half the participants in the seminar went into northern New Jersey for a hike near the Delaware Water Gap. Alumni of previous years will remember this hike with some fondness, and will grieve, no doubt, to learn that there were too many of us this year to fit into the diner at the bottom of the hill, so that this year's participants had to return home without initiation into the mysteries of scrapple.

The second week began with an introduction to hypertext from Geoff Rockwell, who was teaching the hypertext track. He demonstrated freestanding commercial hypertexts, the use of Hypercard to create one's own hypertext, and of course the use of the World Wide Web and HTML for hypertext delivery. The high point, by common consensus, was the Macbeth segment, especially the demonstration of the karaoke Macbeth with Rockwell as Macbeth. Also Monday morning, Greg Crane of Tufts University, described the Perseus Project, a large collection of materials for the study of classical civilization currently delivered with hypertext functionality provided in Hypercard. The Perseus materials, however, Crane was careful to note, are encoded in non-proprietary standard forms: the texts in SGML, the images in standard graphics formats at significantly higher resolutions than are currently deliverable on the desktop. As a result, Perseus will be able to survive and be delivered in other forms when Hypercard has gone the way of all software and disappeared into obsolescence. He capped his talk with a short presentation of the electronic version of the Greek lexicon of Liddell, Scott, and Jones, which has recently been digitized thanks to a grant from the National Endowment for the Humanities. He had expected a great deal of arduous work to be necessary, he said, before the material could be retagged enough to be useful, but had recently discovered that even with relatively rudimentary SGML tagging (identifying little more than entries, definitions, and citations to classical authors) the lexicon can be usefully consulted. He demonstrated how the lexicon could be linked to the morphological analyzer developed for Perseus so that a student can read a text on line, click on a form, and be sent to the appropriate entry in Liddell and Scott. (This does not work, of course, for abolutely every form: the analyzer is stumped by some forms.) The citations in the lexicon can also be analyzed, with large though not absolutely complete success, so that the student can see when a particular word in Thucydides, for example, is discussed in the lexicon. (Since the lexicon focuses, as does Perseus, on the most commonly studied texts, about one word in fifteen in Perseus texts is cited specifically in the lexicon.) As the work of enhancing the markup of the dictionary progresses, even more sophisticated tools and searches will be possible. But even the simple expedient of inserting a blank line between senses can render a complex article more easily read in electronic form than in paper.

Tuesday, Anita Lowry discussed issues of institutional support for electronic texts, drawing on her own extensive experience at Columbia, Iowa, and the experience of colleagues elsewhere. A guest lecture from Richard Gartner of the Bodleian Library gave insight into how that institution is responding, in its own complex ways, corresponding to its own labyrinthine organization, to the advent of electronic texts. Paul Evan Peters of the Coalition for Networked Information also appeared for a guest lecture, in which he addressed policy issues at a national and international level. I never expect discussions of national or international information policy to be coherent, let alone interesting, but Paul Peters is the kind of speaker who could give policy analysis a good reputation.

Wednesday, the omnipresent Peter Robinson reappeared, this time to discuss digital imaging techniques, which he has been dealing with in connection with his project to create CD-ROM editions of all the manuscripts of the Canterbury Tales, tale by tale. Kirk Alexander of Princeton's Interactive Computer Graphics Lab gave a presentation on the Piero Project, which uses CAD software to allow the user to study in great detail the program of a fresco cycle by Piero della Francesca.

On Thursday, Greg Murphy, Peter Robinson, and I gave a brief comparative demonstration of three methods of delivering SGML-encoded electronic text to potential readers. (By this time, the seminar had taken on the character of a TEI tent meeting, and no one in the seminar was admitting to any interest in any text NOT encoded in SGML, and preferably in TEI.) First, we showed a sample text written in TEI form but translated automatically into HTML, to illustrate the critical point that delivery in HTML does not require that the document be maintained in HTML; then we showed the same text as it might be delivered over the network in SGML and displayed in Panorama, the SGML viewer distributed by SoftQuad in both free and commercial forms. Since the sample text was not particularly complex or demanding typographically or hypertextually, Greg Murphy then showed some more complex materials encoded in SGML, which can be found on the CETH home page at http://www.princeton.edu/~gjmurphy/sgml/. These include a selection of Aesop's FABLES with two different Panorama navigators, and two articles from the psyho-analytic literature on aphasia, one by Freud and one by a later commentator. Finally, Peter Robinson showed the original text, compiled as a DynaText book.

In the second half of the morning, George Miller of Princeton spoke about Word Net, a large semantic network of modern English he has been working on for some time, and demonstrated some of the many varieties of software which can exploit the information it contains.

On the Thursday afternoon and Friday morning of the second week, the sessions were given over the presentations by participants in the seminar, describing the long-term goals of the projects they were working on, and the progress made on them during the course of the seminar. These included several WWW pages constructed to address institutional or disciplinary needs, a number of projects in linguistic analysis (of endangered languages in Northwest China, of English, of Estonian, and of Korean), and a variety of editions (of the glossa ordinaria, of a twentieth-century writer influential in Futurist circles, of revolutionary-era American state constitutions, of Mark Twain letters, and of a poem by Puschkin, among others), as well as essays in literary or stylistic analysis, materials for language instruction, and yet more. As in previous years, the participants' presentations were a highlight of the entire seminar, and strengthened me in my belief that with such vigorous and interesting work going on, humanities computing is in very healthy shape.

Congratulations are due to the sponsors (CETH, together with the Centre for Computing in the Humanities at Toronto) and the organizers (an untiring staff at CETH, with assistance from Princeton's Computing and Information Technology group), who provide, in this annual seminar, a signal service to all of us interested in the application of computers to humanistic studies.

C. M. Sperberg-McQueen

Brill Library Home Page Brill Highlights Electronic Texts
Last Updated: February 13, 1996