13 Mar 2016

first step for an archive for tipitaka.de

there are some parts for an archive for www.tipitaka.de
  1. mirror the tipitaka.org data to a github repository so all changes can be tracked
  2. convert the mirror data to a more technical xml format and store the files in an hierarchical directory structure
  3. collect the information about notes inside the original text from tipitaka.org and convert them to be used by the new xml format

the tipitaka.org mirror

https://github.com/tipitaka-org/tipitaka-mirror is the base for the XML format. these files have several problems:
  • they are mostly in UTF-16 encoding which do not work well on standard Linux and MacOS text processing tools for your console/terminal. no grep or sed or cat, etc they all use the system-wide default UTF-8. of course I could try and find out how to switch to UTF-16 but having these tools work out of the box is much nicer.
  • opening a file there is not context. the file-name does tell a little bit but you need to decipher abbreviations.
  • the current format is TEI-format which is used to define how things get rendered. the translation to html is straight forward and there is not much additional meta-information through the format. for example there are notes like 'bhagavāti (syā.)' annotated as note, this example is an alternative text for 'bhagavā' used in the Thai version of the tipitaka - syā indicates the Thai version. BUT the format does not allow me to produce the text of the Thai version of the Tipitaka as it does not tell me which word(s) I need to replace. the note may also be a textual references to another part of the Tipitaka (also heavily abbreviated).
These files will be the source from there the converted data for https://github.com/tipitaka-org/tipitaka-archive and which is used to serve www.tipitaka.de

the xml data

https://github.com/tipitaka-org/tipitaka-archive/blob/master/archive/data/ will have the converted files from the tipitaka.org mirror. the xml format is still a work in progress and will change again.

one thing to point out is the alternative text section

<alternatives line="3">
  <alternative lang="vri">paṭissato</alternative>
  <alternative lang="sī.">patissato</alternative>

which allows the consumer of the file to switch between the VRI version of the Tipitaka and the Sinhalese version from Sri Lanka. To produce those alternatives derived from the original sources an intermediate step is needed - the notes files.

the notes files

https://github.com/tipitaka-org/tipitaka-archive/tree/master/archive/notes/roman and they contain enough information to verify them as well to produce links to the github sources:
here the line 3 is also part of the notes.xml and the file itself contains its archive path.

The matching is done automatically but a manual matching is also possible. For example if some finds a wrong matching then a link to the github repo can be provided with the request to propose a change.

Currently there are over 22000 notes and the tools produce
  • no matching alternative: 2483
  • automatic matching alternatives: 15613
  • raw notes with different information: 4249
  • manual matched alternatives: 1
how good the automatic matching is hard to say at the moment but the next step is verification of the generated data.


Verification is still missing, but the intent is to use the xml-format and produce the original TEI-format and verify that the files are identical. This is still pending along with the fact that there are probably loads of errors and bugs in the current version of the files.


  1. Thank you for the good article.
    We live in the world that has become wider in a sense of business and competition. Everything went on the Web in addition to the existing physical global challenges in business. I heard that one of the latest innovations is moving to data room services - cloud-based security-protected repositories.

  2. Please provide me contact information I'm interested in this project.