31 Mar 2016

TEI format inherited from tipitaka.org

Differences between tipitaka.de and tipitaka.org TEI format

The main idea for TEI (Text Encoding Initiative) is to represent the text in semantic. The XML files from tipitaka.org are using the TEI format, not sure if they follow the semantic idea of TEI. but it is no problem to convert the tipitaka.org files to HTML, as all the rendering info can be almost translated one to one to HTML.

Footnotes

When converting to an XML format this is as straight forward as doing the HTML files. but once the embedded footnotes needs to be parsed and put into a structure nothing is straight forward anymore.

Why bothering about the footnotes ?

The contain tons of information. they contain alternative texts for some words or phrases collected from various sources of the Tipitaka. See tipitaka.org help on footnotes:
for example Atthi me attā’ti vā assa [vāssa (sī. syā. pī.)] can be Atthi me attā’ti vāssa for the Thai(syā.), Sinhalese(sī.) or Pali-Text-Society(pī.) version of the Tipitaka.

While parsing the footnotes there were plenty of other sources beside the three from the tipitaka.org help pages:
a.
aṭṭha.
cūḷani.
cūḷava.
dī.
itivu.
jā.
kaṃ.
khu.
ma.
mahāni.
mahāva.
moga.
mū.
pa.
pe.
pu.
pāci.
pārā.
rū.
saṃ.
su.
syā.§ka.
theragā.
udā.
vi.
visuddhi.
vā.
ṭī.
Asking for further explanations from their support did not yield any response. This means there is already lost information.

XML format of the footnotes

<alternatives>
  <alternative source-abbr="vri" source="vipassana research institut">bhagavā</alternative
  <alternative source-abbr="syā." source="thai">bhagavāti</alternative>
</alternatives>
this snippet will inside the text replacing the word bhagavā which is a totally different structure as it changes the semantic of "what to print where" to "what does this means".

Problems converting the XML back to its original TEI

Sometimes there is word followed by a PAGE-BREAK tag and then followed by a NOTE tag. The PAGE-BREAK separates the NOTE and its word in TEI but in the XML format the NOTE and its "word" or "phrase" are ONE tag followed by the PAGE-BREAK. the XML format feels right as the note is part of the word/phrase and its footnote, the page-break comes always after.

Converting this back will not follow the original TEI format but swaps NOTE and PAGE-BREAK tags if they follow each other directly.

There are also some problems with white-spaces and punctuation here and there. 

Testing the round trip conversion

Converting the original TEI to XML and then back to a TEI format is tested. beside above problems things can be tested if the conversion produces the original file. See the current directories which are already part of the test:

tests on the github repository


1 comment: