How TC Works

In the "Getting Started" entry, you were introduced to this XML, in the file 'Bodley.xml':

<?xml version="1.0" ?> 
<!DOCTYPE TEI SYSTEM "../common/chaucerTC.dtd">
<TEI xmlns=""  xmlns:det="">
 			<publicationStmt><p>Draft for Textual Communities site</p></publicationStmt>
 			<sourceDesc><bibl det:document="Bodley"></bibl></sourceDesc>
   			<refsDecl det:documentRefsDecl="Manuscript"  det:entityRefsDecl="Simple Poetry">
 	 			<p>Textual Communities declarations</p>
		<pb n="110v" facs="BD110V.JPG"/>
			<lb/><div n="Book of the Duchess">
				<head n="Title">The Boke of the Duchesse</head>
				<lb/><l n="1">I haue grete wondir be this light</l>
                                more lines...

There are a few key things you should note about this file:

  • <bibl det:document="Fairfax"></bibl>. This gives the name which this document will have.  If you change this name, save and reload it you will have a new document with this new name. The 'det:document' attribute is in the 'det' (documents, entities and texts) namespace, formally defined by the 'xmlns:det=""' declaration on the <TEI> element. 
  • <refsDecl det:documentRefsDecl="Manuscript" det:entityRefsDecl="Simple Poetry">: this links this document to the two reference declarations : this is a 'Manuscript' document, and will contain 'Simple Poetry' entities. See next sections.

In the sample file above, the key llne is  <pb n="110v" facs="BD110V.JPG"/>.  This is how you indicate the separate pages or folios of the document.  Each <pb> element has two attributes:

  •  n: use this to give the name or number you wish this page to have.  Possible values include: '1', '2', 'Front cover', 'Inside fly leaf', '1r', '1v'
  •  facs: use this to point at an image of the page.  This should be the name of a file you upload to the community, within a zip folder, as explained in the Getting Started documentation.

You can also use a third attribute, rend.  You can use this to instruct the system to extract part of the image only.  This is very useful when you have (say) images of whole openings but you want to look at each page, the right or left half of the opening, on its own.  Say you have an image of an opening, 1-2.jpg, containing folio 1v as the left half and 2r as the right half.  This is how you would encode it:

        <pb n="1v" facs="1-2.jpg" rend="0,0,50,100" />
        <pb n="2r" facs="1-2.jpg" rend="50,0,100,100" />

You need one <pb> element for each page of the manuscript: so a 280 page manuscript will need 280 <pb> elements. There is no theoretical limit to the number of <pb> elements you can include.  Note that you can write either <pb n="2" facs="002.jpg"></pb> or <pb n="2" facs="002.jpg"/>.

**RefsDecls (reference declarations): the magic which makes it all work -- document structure  # #

The last section shows how you put in documents, refer to the pages of the document, and link the pages to images of them.  But as we saw from the Getting Started documentation, Textual Communities does far more than this.  It understands that a document is not just made up of pages, but those pages are composed of lines, or of columns themselves composed of line.  It also understands, crucially, that documents contain what we call 'entities' -- acts of communication, such as an instance of Dante's Commedia, or of the Gospel of John -- and it understands that those entities may be made up of further entities contained within them (as canticles containing cantos containing verses).  Then, most importantly, it knows how to link documents and entities, at every level, so that it knows exactly which part of the first canto of Dante's Inferno is contained in which line of which page of which document.

Textual Communities knows how to do these things through "RefsDecls": reference declarations.  This is a mechanism invented by the Text Encoding Initiative to related encoding within TEI documents to their representation in other systems: for example, associating the statement "Book 1" with <div n="1" type="book">.  We have taken this over, and extended the RefsDecls mechanism into a complete way of analyzing documents and the entities they contain. In essence: TC uses refsDecls declarations to cut each text into pieces, with each piece belonging both to a "document" and an "entity".  All those pieces are then labelled and stored within a database, ready for extraction by the API: page by page, entity by entity.

Textual Communities is built on two pillars.  The first pillar is: 'Documents' -- manuscripts, printed books, anything containing text. We need to be able to identify documents, and then we need to be able to navigate them, page by page, column by column, line by line.  For this we need a referencing scheme: a way of saying, unambiguously, that I am referring to this page, this column, this line of this document.  Through this step, you tell Textual Communities how your documents are to be referenced.  To do this, choose 'Doc RefsDecl' from the 'RefsDecl' drop-down menu in the admin panel in your community:

This will take you to a document referencing system screen:

Here, you see the two default refsDecls provided by Textual Communities, for "Manuscript" and "Print".  Recall the declaration "<refsDecl det:documentRefsDecl="Manuscript" det:entityRefsDecl="Simple Poetry">" in the sample document given at the beginning of this entry.  This tells Textual Communities to use the default refsDecl "Manuscript" to process this file when it is uploaded.

You can gain some sense of how a refsDecl works by choosing the 'Manuscript' declaration in this drop down menu, and studying what appears in the 'XML' box below:

    <cRefPattern matchPattern="urn:det:TCUSask:{{ community_identifier }}:document={{ document_identifier }}:Folio=(.+)" replacementPattern="#xpath(//pb[@n='$1'])">
    <p>This pointer pattern extracts and references a pb element for each folio.</p>

This fragment does two things:

  1.  It declares how each page in each document processed using this refDecl will be named. It will be given the urn "urn:det:TCUSask: community_identifier :document= document_identifier :Folio=(.+)", with the name of your community and document being substituted for community_identifier and document_identifier . It says too each page should be called a 'Folio' (not 'folio' or 'page'.).  It then assigns the Folio as the identifier found in the "n" attribute on each <pb> element.  Thus, in community BD37, document Bodley, the <pb n="110v" facs="BD110V.JPG"/> element is assigned the identifier "urn:det:TCUSask:BD37:document=Bodley:Folio=110v"
  2. It tells the database to associate all the text found in this page with this identifier. What happens is that when each document is loaded, the TC system reads the refsDecls for that document and then runs the Xpath expressions found in each cRefPattern (here, xpath(pb[@n='$1']) across the document, extracting all the text associated with that expression (here, the text of each page) and links it in the database to this page. In this case, it matches the element <pb n="110v" facs="BD110V.JPG"/> and assigns all the text following up to the next <pb> element (or, the end of the document) to that element.

The next fragment of our Manuscript refsDecl identifies each line within a page:

  <cRefPattern matchPattern="urn:det:TCUSask:{{ community_identifier }}:document={{ document_identifier }}:Folio=(.+):Line=(.+)" replacementPattern="#xpath(//pb[@n='$1']/following::lb[@n='$2'])">     
   <p>This pointer pattern extracts and references a lb element within a pb element for each folio.</p>   

Here,the xPath expression /pb[@n='$1']/following::lb[@n='$2']) finds each <lb/> element following each <pb/> element: this is how it knows, for example, that text is within the sixth line of folio 130r of the Bodley manuscript.

**RefsDecls (reference declarations): the magic which makes it all work -- textual entity structure  # #

Now, we go to the second crucial step, relating to the second pillar of Textual Communities: textual entities.  What you are going to transcribe and edit in the documents is text: and not just 'text', in the sense of sequences of letters, but text which is structured, as every act of communication is structured.  Text comes in paragraphs, in sentences, in chapters; in poems, in stanzas, in lines. In Textual Communities we call this structured text 'textual entities', or simply 'entities'.  Just   as we need a referencing system for documents (by page or folio, then column and line), we need a referencing system for entities: by chapter, paragraph or sentence; by poem, stanza and line. This step allows you to specify how you reference the textual entities contained in the documents.

By default, Textual Communities makes available four different text structures:

  1.  'Simple Poetry': if your texts are poems, with each poem made up a series of lines
  2.  'Simple prose': if your text is prose, with each item made up of a series of paragraphs
  3.  'Complex prose': if your prose texts are grouped into collections of items, with each item composed of paragraphs
  4.  'Complex poetry': if your poems are composed of stanzas, each containing a series of lines.

In our sample Bodley.xml file, given above, the encoding <refsDecl det:documentRefsDecl="Manuscript" det:entityRefsDecl="Simple Poetry"> instructs TC to use the refsDecl "Simple Poetry" to process the file.  You can see these declarations by  going to the RefsDecl menu on your admin panel, selecting "Entity RefsDecl", and then choosing one of the four choices from the "BaseRefsDecl" menu.

Here is how the "Simple Poetry" refsDecl looks:

  <cRefPattern matchPattern="urn:det:TCUSask:{{ community_identifier }}:entity=(.+)" replacementPattern="#xpath(//body/div[@n='$1'])">
     <p>This pointer pattern extracts and references each top-level unit of text, as  a top-level div,  as an entity</p>   

Once more, an xPath expression is used to point at a segment of XML: in this case, the top-level div below the containing <body> element. Thus, this expression will match the element <div n="Book of the Duchess"> in the Bodley.xml file, as follows:

  <pb n="110v" facs="BD110V.JPG"/>
  <lb/><div n="Book of the Duchess">

It will assign this segment of text the name "urn:det:TCUSask:BD37:entity=Book of the Duchess:".  Furthermore, TC cross-references the text within this entity to the document structure.  Thus it knows that the first verse of the poem "I haue grete wondir be this light" is identified, in terms of the document structure, as the text in "urn:det:TCUSask:BD37:document=Bodley:Folio=110v:Line=2" and as the text in  "urn:det:TCUSask:BD37:entity=Book of the Duchess:Verse=1".  TC melds these two statements into a single identifer, as follows:

     urn:det:TCUSask:BD37:document=Bodley:Folio=110v:Line=2:entity=Book of the Duchess:Verse=1

That is: here is the text of the first verse of the Book of the Duchess, as it appears in the second line of folio 110v of the Bodley manuscript.

**Textual Communities and overlapping hierarchies  # #

We have mentioned several times that Textual Communities resolves one of the long-standing difficulties of the use of XML and related systems for encoding scholarly texts: the problem of encoding texts which have both a document hierarchy (pages, columns, lines) and what we call an entity hierarchy (book, chapter, verse). Here is how TC resolves that problem.  Suppose that our first verse of the Book of the Duchess,  "I haue grete wondir be this light", was written across two lines of the manuscript, as follows:

   <lb/><div n="Book of the Duchess">
        <head n="Title">The Boke of the Duchesse</head>
        <lb/><l n="1">I haue grete wondir
        <lb/> be this light </l>

In this case, TC would see the first verse line of the Book of the Duchess as being made up of two textual fragments, each with a separate entry in the database and a separate identifier, thus;

  •  urn:det:TCUSask:BD37:document=Bodley:Folio=110v:Line=2:entity=Book of the Duchess:Verse=1: the text "I have grete wondir" appearing in the second line of this folio, and
  •  urn:det:TCUSask:BD37:document=Bodley:Folio=110v:Line=3:entity=Book of the Duchess:Verse=1: the text "be this light" appearing in the third line of this folio

Further, TC understands that the two text fragments are linked, and uses the TEI linking attributes "prev" and "next" to indicate that the text "be this light", in the third line, is a continuation of the text "I have grete wondir", in the second line. Note that this system copes with every kind of overlap between document and entity: across lines, columns, pages; books, chapters, verses, paragraphs, and more.

One might say: it is as simple as that.  Except implementation of this has taken some very complex programming feats.  Hats off to Xiaohan Zhang for figuring out how to do this. 

0 Attachments
Average (0 Votes)
No comments yet. Be the first.