TEI to Web Conversion
As mentioned earlier, the TEI encoded texts are used for both the online version and the typeset version. Here the inner workings of the online version are discussed.
1. Architecture of the website [Back to Contents]
To make available the search facilities and textual analysis tools, the TEI XML is transformed to a format that can be used for online publication. The technical infrastructure that is needed for publication is described next.
The base of the Digital Locke Project website consists of the XML aware search engine XPAT (not to confuse with XPath). On top of the XPAT runs the DLXS-middleware that can be used to publish many different kinds of extensive texts, images, bibliographic data or archival data. The following picture shows a schematic overview of the architecture of the Digital Locke website:
2. Digital Library eXtension Service (DLXS) [Back to Contents]
The software that is used under the hood of DLP is called DLXS, which stands for Digital Library eXtension Service. This software has been developed by the Digital Library Publication Service of the University of Michigan (see www.dlxs.org). DLXS helps educational and non-profit organizations develop their digital library collections. It consists of a suite of tools that allow for publication of all kinds of material on the web. The DLXS software has four special modules for presenting these special types of digital collections:
- Encoded text collections ("Text Class") â see below
- Digital image collections ("Image Class") â specialized in the production image databases as web publication
- Bibliographic data ("Bib Class") â generic module for publishing record based data on the web
- EAD2002-encoded finding aids ("Findaid Class") â specialized version of Text Class that focuses on the publication of archival material encoded in EAD
3. Text Class [Back to Contents]
The Locke texts are published using the Text Class functionality of DLXS. Text Class consists of middleware software that is specialized for textual material. The most important functionality that is offered by Text Class is the search and retrieval of electronic texts. Once retrieved, Text Class provides templates for generating HTML pages for display.
The power of Text Class comes from the ability to use predefined templates for output formats for multiple formats, being serial articles, issues and monographs. For monographs that come with digitized scans of pages, a page turner can be easily used.
Another important aspect (which in fact is an aspect of the complete DLXS software) is the ability to extend the DLXS functionality with your own code by using an object oriented way of making specializations of Perl routines.
4. The search engine (XPAT) [Back to Contents]
The search engine is called XPAT. It is an improvement for PAT, which was brought on the market by Open Text Corporation. Currently XPAT is being maintained by the University of Michigan, where the original code has been developed. XPAT is an 'XML/SGML-aware' search engine. In this sense it cannot be compared to a (native) XML database, because the heart of XPAT is actually still an index with core functionality based on texts. XPAT is fast at searching (large) texts. Its speed can be attributed mainly to the fact that it's actually an index containing words, where for each word in the text the specific location (byte-offset) is registered. The indexed XML may not be altered for this reason because it will break down the complete index, the insertion of one byte would already have effects on the search results.
Apart from searching words in the complete text, XPAT offers
functionality for specific searches in structured data, like XML or
SGML (remember the old days). Using regions XPAT enables the user
to search in specific predefined XML fragments. The region syntax
allows for queries that specify the appearance of the search term.
As an example, a query might state that the search term must appear
<title> which in turn must appear
within the element
<head>. This is the extra
functionality that makes XPAT an XML/SGML-aware search engine.
An XPAT index can be queried in a terminal client on the server, see the following example where one searches for 'association':
caspar@debian60232:~/share/sandbox/dlxs11/idx/l/locke$ xpatu -D locke.dd Digital Library eXtension Service, XPAT, Release 5.3 COPYRIGHT (c) 2000, 2003, 2004 The Regents of the University of Michigan All Rights Reserved >> association; 1: 7 matches >> pr 883788, ..nuscript">13. Association (1697)</TITLE> <AUTHOR>RTF2T.. 538645, ..uct"-part' on association and the '<HI1 REND="italic">Essay</HI1.. 538530, ..ance of wrong association as a cause of error see Schuurman, pp... 537712, ..REND="italic">Association of Ideas</HI1> yet haveing donne it t.. 884595, ..nearnocaret"> Association</ADD1> </P> </DI.. 537474, ..LACE="margin">Association</NOTE1><PB N="208"/><MILESTONE N="41" .. 895151, ..</DEL1><ADD1> associations</ADD1> of them made by custom in the .. >>
This example shows the XPAT client being started with
locke.dd as Data Definition file, in which
all references to the index are stored, including a reference to
the XML text. The simplest query is just one search term. After
performing the search XPAT shows the number of occurrences found in
the index. Using the command pr, the results can be printed on the
terminal as shown in the example. Each line printed shows a hit in
the index, starting with the byte offset of where the hit was found
in the XML. The hit itself is printed with a small context. The
printing of the context can be easily manipulated.
Like mentioned earlier, XPAT is an XML/SGML aware search engine. This means that the text to be indexed is divided into regions. Regions represent specific XML elements occurring in the text. The user can specify regions in which a search term must occur. See for example the following query with the search term 'association' within the XML element ADD1:
>> association within region HI1 12: one match >> pr 537712, ..REND="italic">Association of Ideas</HI1> yet haveing donne it t.. >>
Now instead of 7 matches, we get 1 match.
The XPAT queries have operators familiar from set theory. As every
XPAT result is returned as a set, a query may contain set operators
that operate on sub-queries. Examples of operators are '^' (logical
AND), '|' (logical OR). But also keywords are used: e.g.
within, and not. In the
following example the keyword
including is used:
>> association within (region DIV1 including (region HI1 including "ideas")); 16: 5 matches >> pr 538645, ..uct"-part' on association and the '<HI1 REND="italic">Essay</HI1.. 538530, ..ance of wrong association as a cause of error see Schuurman, pp... 537712, ..REND="italic">Association of Ideas</HI1> yet haveing donne it t.. 537474, ..LACE="margin">Association</NOTE1><PB N="208"/><MILESTONE N="41" .. 895151, ..</DEL1><ADD1> associations</ADD1> of them made by custom in the .. >>
In this query the term
association is searched, but it
must be present within the region '
DIV1' and within
that same region
DIV1 a region '
be present that contains the word '
as a user on the terminal arbitrarily complex queries can be
constructed, but within the context of DLXS it is the middleware
that constructs the queries for the user (using the search forms)
and sends them to XPAT.