31 Mar 2016

TEI format inherited from tipitaka.org

Differences between tipitaka.de and tipitaka.org TEI format

The main idea for TEI (Text Encoding Initiative) is to represent the text in semantic. The XML files from tipitaka.org are using the TEI format, not sure if they follow the semantic idea of TEI. but it is no problem to convert the tipitaka.org files to HTML, as all the rendering info can be almost translated one to one to HTML.


When converting to an XML format this is as straight forward as doing the HTML files. but once the embedded footnotes needs to be parsed and put into a structure nothing is straight forward anymore.

Why bothering about the footnotes ?

The contain tons of information. they contain alternative texts for some words or phrases collected from various sources of the Tipitaka. See tipitaka.org help on footnotes:
for example Atthi me attā’ti vā assa [vāssa (sī. syā. pī.)] can be Atthi me attā’ti vāssa for the Thai(syā.), Sinhalese(sī.) or Pali-Text-Society(pī.) version of the Tipitaka.

While parsing the footnotes there were plenty of other sources beside the three from the tipitaka.org help pages:
Asking for further explanations from their support did not yield any response. This means there is already lost information.

XML format of the footnotes

  <alternative source-abbr="vri" source="vipassana research institut">bhagavā</alternative
  <alternative source-abbr="syā." source="thai">bhagavāti</alternative>
this snippet will inside the text replacing the word bhagavā which is a totally different structure as it changes the semantic of "what to print where" to "what does this means".

Problems converting the XML back to its original TEI

Sometimes there is word followed by a PAGE-BREAK tag and then followed by a NOTE tag. The PAGE-BREAK separates the NOTE and its word in TEI but in the XML format the NOTE and its "word" or "phrase" are ONE tag followed by the PAGE-BREAK. the XML format feels right as the note is part of the word/phrase and its footnote, the page-break comes always after.

Converting this back will not follow the original TEI format but swaps NOTE and PAGE-BREAK tags if they follow each other directly.

There are also some problems with white-spaces and punctuation here and there. 

Testing the round trip conversion

Converting the original TEI to XML and then back to a TEI format is tested. beside above problems things can be tested if the conversion produces the original file. See the current directories which are already part of the test:

tests on the github repository

13 Mar 2016

first step for an archive for tipitaka.de

there are some parts for an archive for www.tipitaka.de
  1. mirror the tipitaka.org data to a github repository so all changes can be tracked
  2. convert the mirror data to a more technical xml format and store the files in an hierarchical directory structure
  3. collect the information about notes inside the original text from tipitaka.org and convert them to be used by the new xml format

the tipitaka.org mirror

https://github.com/tipitaka-org/tipitaka-mirror is the base for the XML format. these files have several problems:
  • they are mostly in UTF-16 encoding which do not work well on standard Linux and MacOS text processing tools for your console/terminal. no grep or sed or cat, etc they all use the system-wide default UTF-8. of course I could try and find out how to switch to UTF-16 but having these tools work out of the box is much nicer.
  • opening a file there is not context. the file-name does tell a little bit but you need to decipher abbreviations.
  • the current format is TEI-format which is used to define how things get rendered. the translation to html is straight forward and there is not much additional meta-information through the format. for example there are notes like 'bhagavāti (syā.)' annotated as note, this example is an alternative text for 'bhagavā' used in the Thai version of the tipitaka - syā indicates the Thai version. BUT the format does not allow me to produce the text of the Thai version of the Tipitaka as it does not tell me which word(s) I need to replace. the note may also be a textual references to another part of the Tipitaka (also heavily abbreviated).
These files will be the source from there the converted data for https://github.com/tipitaka-org/tipitaka-archive and which is used to serve www.tipitaka.de

the xml data

https://github.com/tipitaka-org/tipitaka-archive/blob/master/archive/data/ will have the converted files from the tipitaka.org mirror. the xml format is still a work in progress and will change again.

one thing to point out is the alternative text section

<alternatives line="3">
  <alternative lang="vri">paṭissato</alternative>
  <alternative lang="sī.">patissato</alternative>

which allows the consumer of the file to switch between the VRI version of the Tipitaka and the Sinhalese version from Sri Lanka. To produce those alternatives derived from the original sources an intermediate step is needed - the notes files.

the notes files

https://github.com/tipitaka-org/tipitaka-archive/tree/master/archive/notes/roman and they contain enough information to verify them as well to produce links to the github sources:
here the line 3 is also part of the notes.xml and the file itself contains its archive path.

The matching is done automatically but a manual matching is also possible. For example if some finds a wrong matching then a link to the github repo can be provided with the request to propose a change.

Currently there are over 22000 notes and the tools produce
  • no matching alternative: 2483
  • automatic matching alternatives: 15613
  • raw notes with different information: 4249
  • manual matched alternatives: 1
how good the automatic matching is hard to say at the moment but the next step is verification of the generated data.


Verification is still missing, but the intent is to use the xml-format and produce the original TEI-format and verify that the files are identical. This is still pending along with the fact that there are probably loads of errors and bugs in the current version of the files.