TEI to Web Conversion

As mentioned earlier, the TEI encoded texts are used for both the online version and the typeset version. Here the inner workings of the online version are discussed.

Contents

1. Architecture of the website

2. Digital Library eXtension Service (DLXS)

3. Text Class

4. The search engine (XPAT)

1. Architecture of the website [Back to Contents]

To make available the search facilities and textual analysis tools, the TEI XML is transformed to a format that can be used for online publication. The technical infrastructure that is needed for publication is described next.

The base of the Digital Locke Project website consists of the XML aware search engine XPAT (not to confuse with XPath). On top of the XPAT runs the DLXS-middleware that can be used to publish many different kinds of extensive texts, images, bibliographic data or archival data. The following picture shows a schematic overview of the architecture of the Digital Locke website:

Schematic overview of web publication

2. Digital Library eXtension Service (DLXS) [Back to Contents]

The software that is used under the hood of DLP is called DLXS, which stands for Digital Library eXtension Service. This software has been developed by the Digital Library Publication Service of the University of Michigan (see www.dlxs.org). DLXS helps educational and non-profit organizations develop their digital library collections. It consists of a suite of tools that allow for publication of all kinds of material on the web. The DLXS software has four special modules for presenting these special types of digital collections:

3. Text Class [Back to Contents]

The Locke texts are published using the Text Class functionality of DLXS. Text Class consists of middleware software that is specialized for textual material. The most important functionality that is offered by Text Class is the search and retrieval of electronic texts. Once retrieved, Text Class provides templates for generating HTML pages for display.

The power of Text Class comes from the ability to use predefined templates for output formats for multiple formats, being serial articles, issues and monographs. For monographs that come with digitized scans of pages, a page turner can be easily used.

Another important aspect (which in fact is an aspect of the complete DLXS software) is the ability to extend the DLXS functionality with your own code by using an object oriented way of making specializations of Perl routines.

4. The search engine (XPAT) [Back to Contents]

The search engine is called XPAT. It is an improvement for PAT, which was brought on the market by Open Text Corporation. Currently XPAT is being maintained by the University of Michigan, where the original code has been developed. XPAT is an 'XML/SGML-aware' search engine. In this sense it cannot be compared to a (native) XML database, because the heart of XPAT is actually still an index with core functionality based on texts. XPAT is fast at searching (large) texts. Its speed can be attributed mainly to the fact that it's actually an index containing words, where for each word in the text the specific location (byte-offset) is registered. The indexed XML may not be altered for this reason because it will break down the complete index, the insertion of one byte would already have effects on the search results.

Apart from searching words in the complete text, XPAT offers functionality for specific searches in structured data, like XML or SGML (remember the old days). Using regions XPAT enables the user to search in specific predefined XML fragments. The region syntax allows for queries that specify the appearance of the search term. As an example, a query might state that the search term must appear within element <title> which in turn must appear within the element <head>. This is the extra functionality that makes XPAT an XML/SGML-aware search engine.

An XPAT index can be queried in a terminal client on the server, see the following example where one searches for 'association':


caspar@debian60232:~/share/sandbox/dlxs11/idx/l/locke$ xpatu -D locke.dd 

Digital Library eXtension Service, XPAT, Release 5.3 
COPYRIGHT (c) 2000, 2003, 2004 The Regents of the University of Michigan 
All Rights Reserved 

>> association; 
1: 7 matches 

>> pr 
883788, ..nuscript">13. Association (1697)</TITLE>           <AUTHOR>RTF2T.. 
538645, ..uct"-part' on association and the '<HI1 REND="italic">Essay</HI1.. 
538530, ..ance of wrong association as a cause of error see Schuurman, pp... 
537712, ..REND="italic">Association of Ideas</HI1> yet haveing donne it  t.. 
884595, ..nearnocaret"> Association</ADD1>             </P>           </DI.. 
537474, ..LACE="margin">Association</NOTE1><PB N="208"/><MILESTONE N="41" .. 
895151, ..</DEL1><ADD1> associations</ADD1> of them made by custom in the .. 

>> 

This example shows the XPAT client being started with locke.dd as Data Definition file, in which all references to the index are stored, including a reference to the XML text. The simplest query is just one search term. After performing the search XPAT shows the number of occurrences found in the index. Using the command pr, the results can be printed on the terminal as shown in the example. Each line printed shows a hit in the index, starting with the byte offset of where the hit was found in the XML. The hit itself is printed with a small context. The printing of the context can be easily manipulated.

Like mentioned earlier, XPAT is an XML/SGML aware search engine. This means that the text to be indexed is divided into regions. Regions represent specific XML elements occurring in the text. The user can specify regions in which a search term must occur. See for example the following query with the search term 'association' within the XML element ADD1:


>> association within region HI1 
12: one match 

>> pr 
537712, ..REND="italic">Association of Ideas</HI1> yet haveing donne it  t.. 

>> 

Now instead of 7 matches, we get 1 match.

The XPAT queries have operators familiar from set theory. As every XPAT result is returned as a set, a query may contain set operators that operate on sub-queries. Examples of operators are '^' (logical AND), '|' (logical OR). But also keywords are used: e.g. including, within, and not. In the following example the keyword including is used:


>> association within (region DIV1 including (region HI1 including "ideas")); 
16: 5 matches 

>> pr 
538645, ..uct"-part' on association and the '˜<HI1 REND="italic">Essay</HI1.. 
538530, ..ance of wrong association as a cause of error see Schuurman, pp... 
537712, ..REND="italic">Association of Ideas</HI1> yet haveing donne it  t.. 
537474, ..LACE="margin">Association</NOTE1><PB N="208"/><MILESTONE N="41" .. 
895151, ..</DEL1><ADD1> associations</ADD1> of them made by custom in the .. 

>>

In this query the term association is searched, but it must be present within the region 'DIV1' and within that same region DIV1 a region 'HI1' must be present that contains the word 'ideas'. Obviously, as a user on the terminal arbitrarily complex queries can be constructed, but within the context of DLXS it is the middleware that constructs the queries for the user (using the search forms) and sends them to XPAT.