Skip to content

Commit

Permalink
metadata
Browse files Browse the repository at this point in the history
  • Loading branch information
dirkroorda committed Dec 23, 2021
1 parent 90161ff commit 976a2c8
Show file tree
Hide file tree
Showing 32 changed files with 59,698 additions and 529 deletions.
71 changes: 69 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,69 @@
# dhammapada
Text of the Dhammapadi (Pali language) with Latin translation
# Dhammapada latine

[![SWH](https://archive.softwareheritage.org/badge/origin/https://github.com/ETCBC/bhsa/)](https://archive.softwareheritage.org/browse/origin/https://github.com/ETCBC/bhsa/)
[![DOI](https://zenodo.org/badge/104559294.svg)](https://zenodo.org/badge/latestdoi/104559294)
[![Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostatus.org/#wip)

[![etcbc](programs/images/etcbc.png)](http://www.etcbc.nl)
![logo](programs/images/logo.png)
[![dans](programs/images/dans.png)](https://dans.knaw.nl/en)
[![tf](programs/images/tf-small.png)](https://annotation.github.io/text-fabric/tf)


## About

This is the
[text-fabric](https://github.com/Dans-labs/text-fabric/wiki)
representation of the Dhammapada in the edition with Latin translation by V. Fausböll, 1900.

See [about](docs/about.md) for more information about this textual source.

The conversion to Text-Fabric is joint work of

* [prof. dr. Bee Scherer](https://research.vu.nl/en/persons/bee-scherer),
Text and Traditions,
VU-University Amsterdam;
* [prof. dr. Willem van Peursen](https://research.vu.nl/en/persons/willem-van-peursen),
[ETCBC](http://www.etcbc.nl),
VU-University Amsterdam;
* Yvonne Mataar,
transcription and correction
* [dr. Dirk Roorda](https://pure.knaw.nl/portal/en/persons/dirk-roorda),
[DANS](https://www.dans.knaw.nl),
conversion to the Text-Fabric-sphere.

There is more information on the
[transcription](https://github.com/etcbc/blob/master/docs/transcription.md)

## How to use

This data can be processed by
[Text-Fabric](https://annotation.github.io/text-fabric/tf).

Text-Fabric will automatically download the BHSA data.

After installing Text-Fabric, you can start the Text-Fabric browser by this command

```sh
text-fabric dhammapada
```

Alternatively, you can work in a Jupyter notebook and say

```python
from tf.app import use

A = use('dhammapada')
```

In both cases the data is downloaded and ends up in your home directory,
under `text-fabric-data`.

See also
[start](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/dhammapada/start.ipynb)
and
[search](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/dhammapada/search.ipynb).

# Author

[Dirk Roorda](https://github.com/dirkroorda)
66 changes: 66 additions & 0 deletions docs/about.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# About the Dhammapada

The Dhammapada is a collection of sayings of the Buddha in verse form
and one of the most widely read and best known Buddhist scriptures.
The first written source dates from 300 BCE.

It is written in the Pāli language, which is a close relative of Sanskrit.

## Additional resources

* [wikipedia Dhammapada](https://en.wikipedia.org/wiki/Dhammapada)
* [wikipedia Pāli](https://en.wikipedia.org/wiki/Pali), ISO-codes `pli`, `pi`
* [Pāli-English with comments, by stanza](https://www.tipitaka.net/tipitaka/dhp/)
* [Interlinear Pāli-English (single pdf)](https://www.ancient-buddhist-texts.net/Texts-and-Translations/Dhammapada/Dhammapada.pdf)
* [English translation online](http://www.buddhanet.net/e-learning/buddhism/dhamma.htm)
* [English translation (single PDF)](http://www.buddhanet.net/pdf_file/scrndhamma.pdf)

# About this text

The text that is the source of this dataset rests on the work of
[Viggo Fausböll](https://en.wikipedia.org/wiki/Viggo_Fausböll) who translated the
Dhammapada into Latin in 1855.


field | value
--- | ---
title | `The Dhammapada`
subtitle | `being a collection of moral verses in Pāli`
remark | `edited a second time with a literal latin translation and notes for the use of Pāli students`
editor | `V. Fausboll`
publisher | `Luzac & Co., Publishers to the India Office`
publisher address | `46, Great Russell Street, W.C. London`
published | `1900`

The cover page of the edition that is the source of this dataset:

![cover](images/cover.png)

# The text

The Dhammapada is divided in *vaggas* which are divided in *stanzas*.
There are 26 vaggas and 423 stanzas, which are numbered consecutively throughout the whole
work.

The book uses the latin script for the Pāli text.

As an example, here are the first 7 stanzas of the first vagga in Pāli:

![pali7](images/pali7.png)

and here the same stanzas in Latin:

![pali7](images/pali7.png)

## Additional resources

* Fausbøll, Michael Viggo, The Dhammapada. Being a collection of moral verses in Pali.
Edited a second time with a literal Latin translation and notes
for the use of Pali students.
[free fragment of an article by Burkhard Scherer (pdf)](https://link.springer.com/article/10.1023/A:1012252226747)

# The conversion

The conversion program in in [tfFromTxt.py](programs/tfFromTxt.py).
It can be seen in action in a Jupyter notebook:
[convert.ipynb](https://nbviewer.org/github/etcbc/dhammapada/blob/master/programs/convert.ipynb)
Binary file added docs/images/cover.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/latin7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/pali7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
195 changes: 195 additions & 0 deletions docs/transcription.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
<img src="images/logo.png" align="right" width="200"/>
<img src="images/tf.png" align="right" width="200"/>

# Feature documentation

Here you find a description of the transcriptions of the Dhammapada,
the
[Text-Fabric model](https://annotation.github.io/text-fabric/tf/about/datamodel.html)
in general, and the node types, features of the
Dhammapada corpus in particular.

See also

* [about](about.md) for the provenance of the data;
* [TF docs](https://annotation.github.io/text-fabric/tf) for documentation on Text-Fabric.

## Transcription

The corpus consists of a text in Pāli and a Latin translation of the text.
The main subdivision is in 26 units named *vaggas*, which are themselves divided
into stanzas. There are 423 stanzas in the whole work and they are numbered across the vaggas
from 1 to 423. These numbers are coded in the feature `n`.

The original text and its translation are linked stanza-wise.

During conversion we have made a finer division in clauses and sentences.
Sentences are terminated by `.` and `?`, clauses are terminated by `;`, `:`, and
also by `-` when it is not attached to a word.

Clauses are subdivided in words, and words consist of
non-letters before, letters, and non-letters after.

Sentence and clauses sometimes cross stanza boundaries boundaries, but never
vagga boundaries.
That is why we number sentences and clauses by their sequence number within their
vaggas, again in feature `n`.

Most words are separated by spaces, but we also make word divisions in strings like
`(qui-)que`.

In the Latin text we encounter `( )`: this is material added for clarity by author
of the translation, Fausbøll. We code it in the feature `clarity`, see below.

In the Pāli text we also encounter `[ ]`: this is material that is not completely certain.
We code it in the feature `uncertain`, see below.

In both text there is quoted material. We normalize the quotes to the ASCII double quote
`"`, and we mark words in a quotation by means of the feature `quote`.

There is (very little) material outside stanzas: one case of interstanza material,
and several cases at the start and end of vaggas.
We mark this material with the feature `extrastanza`.
The stanza number for extra stanza material is the stanza number of the nearest stanza in the
same vagga, increased by 1000. So a 4-digit stanza number is by definition not a real stanza.
And a 3-digit stanza is always a real stanza.

Sentences, clauses and words either belong to the Pāli original or to the Latin
translation. The feature `trans` codes which is the case.

!!! caution "Mind the twins"
The fact that stanzas contain both the original and the translation has these consequences:

* If you count the words inside a stanza, you add up the Pāli words and the
Latin words. Likewise if you count sentences and clauses.
* If you want to count only words, clauses, sentences of one text type,
use the `trans` feature to distinguish between them.
* If you count the words *within* sentences or clauses, you count the words of
one text type only.


## Text-Fabric model

The Text-Fabric model views the text as a series of atomic units, called
*slots*. In this corpus [*words*](#word) are the slots.

On top of that, more complex textual objects can be represented as *nodes*. In
this corpus we have node types for:

[*word*](#word),
[*clause*](#clause),
[*sentence*](#sentence),
[*stanza*](#stanza),
[*vagga*](#vagga),

The type of every node is given by the feature
[**otype**](https://annotation.github.io/text-fabric/tf/cheatsheet.html#f-node-features).
Every node is linked to a subset of slots by
[**oslots**](https://annotation.github.io/text-fabric/tf/cheatsheet.html#special-edge-feature-oslots).

Nodes can be annotated with features.
Relations between nodes can be annotated with edge features.
See the table below.

Text-Fabric supports up to three customizable section levels.
In this corpus we use only two:
[*vagga*](#vagga) and [*stanza*](#stanza).

# Reference table of features

*(Keep this under your pillow)*

## *absent*

When we say that a feature is *absent* for a node, we mean that the node has no value
for the feature. For example, if the feature `trans` is absent for node `n`, then
`F.trans.v(n)` results in the Python value `None`, not the string `'None'`.

In queries, you can test for absence by means of `#`:

```
word trans#
```

gives all lines where the feature `trans` is absent (these are all the Pāli words).

See also
[search templates](https://annotation.github.io/text-fabric/tf/about/searchusage.html)
under **Value specifications**.

## Node type [*word*](#word)

Basic unit containing a word plus attached non-word stuff such as punctuation,
or a text-critical sign like `( ) [ ]`.

feature | values | description
------- | ------ | ------ | ----------- | --- | ---
**pali** | `manasā` | the real word letters of a Pāli word
**latin** | `mente` | the real word letters of a Latin word
**palipre** | `[` | immediately preceding non-word characters of a Pāli word
**latinpre** | `[` | immediately preceding non-word characters of a Latin word
**palipost** | `[` | non-word characters after of a Pāli word, including whitespace
**latinpost** | `[` | non-word characters after of a Latin wor, including whitespaced
**extrastanza** | `1` | indicates the word is outside a stanza
**quote** | `1` | indicates the word is inside a quotation
**uncertain** | `1` | **Pāli only**: indicates the word is uncertain (somewhere inside a `[ ]` pair
**clarity** | `1` | **Latin only**: indicates the word is added for clarity (somewhere inside a `( )` pair
**trans** | `1` | indicates the word belongs to the Latin translation

## Node type [*clause*](#clause)

Subdivision of a containing [*sentence*](#sentence).

feature | values | description
------- | ------ | ------
**n** | `1` `2` | sequence number of a clause within its vagga
**trans** | `1` | indicates the clause belongs to the Latin translation

## Node type [*sentence*](#sentence)

Subdivision of a containing [*vagga*](#vagga).

feature | values | description
------- | ------ | ------
**n** | `1` `2` | sequence number of a sentence within its vagga
**trans** | `1` | indicates the sentence belongs to the Latin translation

## Node type [*stanza*](#stanza)

Section level 2.

Subdivision of a containing [*vagga*](#vagga).

feature | values | description
------- | ------ | ------
**n** | `1` `2` | sequence number of a stanza within the whole work

## Node type [*vagga*](#vagga)

Section level 1.

Subdivision of the whole work.

feature | values | description
------- | ------ | ------
**n** | `1` `2` | sequence number of a vagga within the whole work

# Text formats

The following text formats are defined (you can also list them with `T.formats`).

format | description
--- | --- | ---
`text-orig-full` | prints the text of all words, Pāli and Latin
`text-pali-full` | prints the text of all Pāli words and leaves Latin words empty
`text-latin-full` | prints the text of all Latin words and leaves Pāli words empty
`layout-orig-full` | as `text-orig-full` but with special layout for quote, uncertain, clarity, etc.
`layout-pali-full` | as `text-pali-full` but with special layout for quote, uncertain, clarity, etc.
`layout-latin-full` | as `text-latin-full` but with special layout for quote, uncertain, clarity, etc.

The formats with `text` result in strings that are plain text, without additional formatting.

The formats with `layout` result in pieces html with css-styles;
the richness of layout enables us to code more information
in the plain representation, e.g. blurry characters when words are uncertain.
We also use different colours for Pali and Latin.
Loading

0 comments on commit 976a2c8

Please sign in to comment.