- mirror the tipitaka.org data to a github repository so all changes can be tracked
- convert the mirror data to a more technical xml format and store the files in an hierarchical directory structure
- collect the information about notes inside the original text from tipitaka.org and convert them to be used by the new xml format
the tipitaka.org mirrorhttps://github.com/tipitaka-org/tipitaka-mirror is the base for the XML format. these files have several problems:
- they are mostly in UTF-16 encoding which do not work well on standard Linux and MacOS text processing tools for your console/terminal. no grep or sed or cat, etc they all use the system-wide default UTF-8. of course I could try and find out how to switch to UTF-16 but having these tools work out of the box is much nicer.
- opening a file there is not context. the file-name does tell a little bit but you need to decipher abbreviations.
- the current format is TEI-format which is used to define how things get rendered. the translation to html is straight forward and there is not much additional meta-information through the format. for example there are notes like 'bhagavāti (syā.)' annotated as note, this example is an alternative text for 'bhagavā' used in the Thai version of the tipitaka - syā indicates the Thai version. BUT the format does not allow me to produce the text of the Thai version of the Tipitaka as it does not tell me which word(s) I need to replace. the note may also be a textual references to another part of the Tipitaka (also heavily abbreviated).
the xml datahttps://github.com/tipitaka-org/tipitaka-archive/blob/master/archive/data/ will have the converted files from the tipitaka.org mirror. the xml format is still a work in progress and will change again.
one thing to point out is the alternative text section
<alternatives line="3"> <alternative lang="vri">paṭissato</alternative> <alternative lang="sī.">patissato</alternative> </alternatives>
which allows the consumer of the file to switch between the VRI version of the Tipitaka and the Sinhalese version from Sri Lanka. To produce those alternatives derived from the original sources an intermediate step is needed - the notes files.
the notes fileshttps://github.com/tipitaka-org/tipitaka-archive/tree/master/archive/notes/roman and they contain enough information to verify them as well to produce links to the github sources:
here the line 3 is also part of the notes.xml and the file itself contains its archive path.
The matching is done automatically but a manual matching is also possible. For example if some finds a wrong matching then a link to the github repo can be provided with the request to propose a change.
Currently there are over 22000 notes and the tools produce
- no matching alternative: 2483
- automatic matching alternatives: 15613
- raw notes with different information: 4249
- manual matched alternatives: 1