Skip to content

Commit

Permalink
towards completion
Browse files Browse the repository at this point in the history
  • Loading branch information
dirkroorda committed Dec 24, 2021
1 parent dd9af05 commit 9ffefaf
Show file tree
Hide file tree
Showing 22 changed files with 14,740 additions and 2,818 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
## About

This is the
[text-fabric](https://github.com/Dans-labs/text-fabric/wiki)
[text-fabric](https://github.com/annotation/text-fabric)
representation of the Dhammapada in the edition with Latin translation by V. Fausböll, 1900.

See [about](docs/about.md) for more information about this textual source.
Expand All @@ -33,14 +33,14 @@ The conversion to Text-Fabric is joint work of
conversion to the Text-Fabric-sphere.

There is more information on the
[transcription](docs/transcription.md)
[transcription](docs/transcription.md).

## How to use

This data can be processed by
[Text-Fabric](https://annotation.github.io/text-fabric/tf).

Text-Fabric will automatically download the BHSA data.
Text-Fabric will automatically download the corpus data.

After installing Text-Fabric, you can start the Text-Fabric browser by this command

Expand Down
21 changes: 16 additions & 5 deletions docs/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ It is written in the Pāli language, which is a close relative of Sanskrit.
The text that is the source of this dataset rests on the work of
[Viggo Fausböll](https://en.wikipedia.org/wiki/Viggo_Fausböll) who translated the
Dhammapada into Latin in 1855.
We took the text from the 1900 edition of his book in which he included the
original Pāli text in Latin script.


field | value
Expand All @@ -32,7 +34,7 @@ publisher | `Luzac & Co., Publishers to the India Office`
publisher address | `46, Great Russell Street, W.C. London`
published | `1900`

The cover page of the edition that is the source of this dataset:
The cover page of that edition is:

![cover](images/cover.png)

Expand All @@ -42,15 +44,13 @@ The Dhammapada is divided in *vaggas* which are divided in *stanzas*.
There are 26 vaggas and 423 stanzas, which are numbered consecutively throughout the whole
work.

The book uses the latin script for the Pāli text.

As an example, here are the first 7 stanzas of the first vagga in Pāli:

![pali7](images/pali7.png)

and here the same stanzas in Latin:

![pali7](images/pali7.png)
![latin7](images/latin7.png)

## Additional resources

Expand All @@ -61,6 +61,17 @@ and here the same stanzas in Latin:

# The conversion

The conversion program in in [tfFromTxt.py](programs/tfFromTxt.py).
The conversion program in in [tfFromTxt.py](../programs/tfFromTxt.py).
It can be seen in action in a Jupyter notebook:
[convert.ipynb](https://nbviewer.org/github/etcbc/dhammapada/blob/master/programs/convert.ipynb)

# Progress

* **2021-12-24** First version. The text-fabric features correspond to the plain texts and
the obvious structure in vaggas, stanzas, sentences, and clauses.
No attempts to add linguistic features have been made so far.

# Future work

We can use the current dataset to generate workflows to annotate the texts
with linguistic features, such as lemma, part-of-speech, etc.
757 changes: 567 additions & 190 deletions programs/convert.ipynb

Large diffs are not rendered by default.

53 changes: 42 additions & 11 deletions programs/tfFromTxt.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from tf.convert.walker import CV


VERSION = "0.1"
VERSION = "0.2"
SLOT_TYPE = "word"
GENERIC = dict(
language="pli,lat",
Expand All @@ -26,6 +26,7 @@
yearPublished="1900",
copynote1="Digitisation supported by Shri Brihad Bhartiya Samaj 20 February 2020",
stamp="50480",
version=VERSION,
)
OTEXT = {
"fmt:text-orig-full": "{palipre/latinpre}{pali/latin}{palipost/latinpost}",
Expand All @@ -34,7 +35,15 @@
"sectionTypes": "vagga,stanza",
"sectionFeatures": "n,n",
}
INT_FEATURES = {"n", "trans", "extrastanza", "quote", "uncertain", "clarity"}
INT_FEATURES = {
"n",
"trans",
"extrastanza",
"quote",
"uncertain",
"clarity",
"freq_occ",
}

FEATURE_META = dict(
n=dict(
Expand Down Expand Up @@ -75,8 +84,7 @@
format="1 (=true) or absent (=false)",
),
quote=dict(
description="word is inside a quote",
format="1 (=true) or absent (=false)"
description="word is inside a quote", format="1 (=true) or absent (=false)"
),
uncertain=dict(
description=(
Expand All @@ -96,6 +104,10 @@
description="whether the node belongs to the original text or a translation",
format="1 (=Latin translation) or absent (=Pali original)",
),
freq_occ=dict(
description="the number of times that this word occurs",
format="positive integer",
),
)


Expand Down Expand Up @@ -410,7 +422,9 @@ def tokenize(self, src=None):
words.append([preBracket, False])
words.append([bracketed + postBracket, False])
else:
words.append([preBracket + bracketed + postBracket, False])
words.append(
[preBracket + bracketed + postBracket, False]
)
break
else:
if realPre:
Expand Down Expand Up @@ -441,6 +455,8 @@ def tokenize(self, src=None):
if word == "-":
tokens[-1][-1] += " - "
affixes[POST][tokens[-1][-1]] += 1
cur["sentence"] += 1
cur["clause"] += 1
continue

if word == "":
Expand Down Expand Up @@ -502,11 +518,12 @@ def tokenize(self, src=None):
elif "”" in postWord:
cur["quote"] = False
postWord = postWord.replace("”", '"')

tokens[-1][-1] = postWord
if (
"," in postWord
or ";" in postWord
or ":" in postWord
or "-" in postWord
):
cur["clause"] += 1
if "." in postWord or "?" in postWord:
Expand Down Expand Up @@ -692,6 +709,9 @@ def makeTf(self):
chunks = self.chunks
cv = CV(Fabric(locations=TF_DIR))

freqOcc = collections.Counter()
wordIndex = {}

def director(cv):
SENTENCE = "sentence"
CLAUSE = "clause"
Expand Down Expand Up @@ -781,6 +801,8 @@ def director(cv):
)
)
cv.feature(wordNode, **wordFeatures)
freqOcc[word] += 1
wordIndex[wordNode] = word

myLast[CLAUSE] = clause

Expand All @@ -800,6 +822,9 @@ def director(cv):
cv.terminate(n)
cv.terminate(vaggaNode)

for (n, word) in wordIndex.items():
cv.feature(n, freq_occ=freqOcc[word])

return cv.walk(
director,
SLOT_TYPE,
Expand All @@ -815,11 +840,17 @@ def loadTf(self):
allFeatures = TF.explore(silent=True, show=True)
loadableFeatures = allFeatures["nodes"] + allFeatures["edges"]
api = TF.load(loadableFeatures, silent=False)
if api:
print(f"max node = {api.F.otype.maxNode}")
print("Frequencies of words")
for (word, n) in api.F.pali.freqList()[0:20]:
print(f"{n:>6} x {word}")
F = api.F

if not api:
return False

print(f"max node = {api.F.otype.maxNode}")
print("Frequencies of words")
for (word, n) in F.pali.freqList()[0:20]:
print(f"{n:>6} x {word}")
for w in F.otype.s("word")[0:50]:
print(f"{F.freq_occ.v(w):>4} x {F.pali.v(w) or F.latin.v(w)}")

def interLinking(self):
pass
45 changes: 22 additions & 23 deletions sources/pali.txt
Original file line number Diff line number Diff line change
Expand Up @@ -214,7 +214,7 @@ yo ca sīlavataṃ gandho vāti devesu uttamo.
sammadaññāvimuttānaṃ Māro maggaṃ na vindati.

58 yathā saṃkāradhānasmiṃ ujjhitasmiṃ mahāpathe
padumaṃ tattha jāyetha sucigandhaṃ manoramaṃ,
padumaṃ tattha jāyetha sucigandhaṃ manoramaṃ;

59 evaṃ saṃkārabhūtesu andhabhūte puthujjane
atirocati paññāya sammāsambuddhasāvako.
Expand Down Expand Up @@ -412,19 +412,19 @@ cattāro dhammā vaḍḍhanti: āyu vaṇṇo sukhaṃ balaṃ.
110 yo ca vassasataṃ jīve dussīlo asamāhito -
ekāhaṃ jīvitaṃ seyyo sīlavantassa jhāyino.

111 yo ca vassasataṃ jīve duppañño asamāhito
111 yo ca vassasataṃ jīve duppañño asamāhito -
ekāhaṃ jīvitaṃ seyyo paññāvantassa jhāyino.

112 yo ca vassasataṃ jīve kusīto hīnavīriyo
ekāhaṃ jīvitaṃ seyyo vīryam ārabhato daḷhaṃ.

113 yo ca vassasataṃ jīve apassaṃ udayavyayaṃ
113 yo ca vassasataṃ jīve apassaṃ udayavyayaṃ -
ekāhaṃ jīvitaṃ seyyo passato udayavyayaṃ.

114 yo ca vassasataṃ jīve apassaṃ amataṃ padaṃ
114 yo ca vassasataṃ jīve apassaṃ amataṃ padaṃ -
ekāhaṃ jīvitaṃ seyyo passato amataṃ padaṃ.

115 yo ca vassasataṃ jīve apassaṃ dhammam uttamaṃ
115 yo ca vassasataṃ jīve apassaṃ dhammam uttamaṃ -
ekāhaṃ jīvitaṃ seyyo passato dhammaṃ uttamam.

Sahassavaggo aṭṭhamo
Expand Down Expand Up @@ -509,7 +509,7 @@ evaṃ jarā ca maccu ca āyuṃ pācenti pāṇinaṃ.
sehi kammehi dummedho aggidaḍḍho va tappati.

137 yo daṇḍena adaṇḍesu appaduṭṭhesu dussati
dasann' aññataraṃ ṭhānaṃ khippam eva nigacchati.
dasann' aññataraṃ ṭhānaṃ khippam eva nigacchati:

138 vedanaṃ pharusaṃ jāniṃ sarīrassa ca bhedanaṃ
garukaṃ vâpi ābādhaṃ cittakkhepaṃ va pāpuṇe.
Expand Down Expand Up @@ -565,7 +565,7 @@ kāpotakāni aṭṭhīni tāni disvāna kā rati.
yattha jarā ca maccu ca māno makkho ca ohito.

151 jīranti ve rājarathā sucittā
atho sarīram pi jaraṃ upeti
atho sarīram pi jaraṃ upeti.
satañ ca dhammo na jaraṃ upeti
santo have sabbhi pavedayanti.

Expand All @@ -575,7 +575,6 @@ maṃsāni tassa vaḍḍhanti paññā tassa na vaḍḍhati.
153 anekajātisaṃsāraṃ sandhāvissaṃ anibbisaṃ
gahakārakaṃ gavesanto, dukkhā jāti punappunaṃ.


154 gahakāraka diṭṭḥo si puna gehaṃ na kāhasi,
sabbā te phāsukā bhaggā gahakūṭaṃ visaṃkhitaṃ,
visaṃkhāragataṃ cittaṃ taṇhānaṃ khayam ajjhagā.
Expand All @@ -599,7 +598,7 @@ ath' aññam anusāseyya, na kilisseyya paṇḍito.
159 attānañ ce tathā kayrā yath' aññam anusāsati
sudanto vata dametha, attā hi kira duddamo.

160 attā hi attano nātho, ko hi nātho paro siyā;
160 attā hi attano nātho, ko hi nātho paro siyā,
attanā hi sudantena nāthaṃ labhati dullabhaṃ.

161 attanā va kataṃ pāpaṃ
Expand Down Expand Up @@ -672,7 +671,7 @@ Lokavaggo terasamo

14. Buddhavagga
179 yassa jitaṃ nâvajīyati
jitaṃ assa no yāti koci loke,
jitaṃ assa no yāti koci loke
tam buddham anantagocaraṃ
apadaṃ kena padena nessatha.

Expand Down Expand Up @@ -702,7 +701,7 @@ mattaññutā ca bhattasmiṃ pantañ ca sayanāsanaṃ
adhicitte ca āyogo etaṃ Buddhāna sāsanaṃ.

186 na kahāpaṇavassena titti kāmesu vijjati,
"appassādā dukhā kāmā" iti viññāya paṇḍito.
"appassādā dukhā kāmā" iti viññāya paṇḍito

187 Api dibbesu kāmesu ratiṃ so nâdhigacchati,
taṇhakkhayarato hoti sammāsambuddhasāvako.
Expand All @@ -723,7 +722,7 @@ sammapaññāya passati
191 dukkhaṃ dukkhasamuppādaṃ
dukkhassa ca atikkamaṃ
ariyañ c' aṭṭhaṅgikaṃ maggaṃ
dukkhūpasamagāminaṃ.
dukkhūpasamagāminaṃ

192 etaṃ kho saraṇaṃ khemaṃ etaṃ saraṇam uttamaṃ
etaṃ saraṇaṃ āgamma sabbadukkhā pamuccati.
Expand All @@ -735,7 +734,7 @@ yattha so jāyatī dhīro taṃ kulaṃ sukham edhati.
sukhā saṃghassa sāmaggī samaggānaṃ tapo sukho.

195 pūjārahe pūjayato Buddhe yadiva sāvake
papañcasamatikkante tiṇṇasokapariddave,
papañcasamatikkante tiṇṇasokapariddave

196 te tādise pūjayato nibbute akutobhaye
na sakkā puññaṃ saṃkhātuṃ im' ettam api kenaci.
Expand Down Expand Up @@ -864,7 +863,7 @@ mitabhāṇinam pi nindanti, n' atthi loke anindito.
ekantaṃ nindito poso ekantaṃ vā pasaṃsito.

229 yañ ce viññū pasaṃsanti anuvicca suve suve
acchiddavuttiṃ medhāviṃ paññāsīlasamāhitaṃ.
acchiddavuttiṃ medhāviṃ paññāsīlasamāhitaṃ

230 nekkhaṃ jambonadassêva ko taṃ ninditum arhati,
devâpi naṃ pasaṃsanti, Brahmunâpi pasaṃsito.
Expand Down Expand Up @@ -1119,7 +1118,7 @@ yesaṃ divā ca ratto ca niccaṃ Saṃgha-gatā sati.
299 suppabuddhaṃ pabujjhanti sadā Gotamasāvakā
yesaṃ divā ca ratto ca niccaṃ kāyagatā sati.

300 suppabuddhaṃ pabujjhanti sadā Gotamasāvakā
300 suppabuddhaṃ pabujjhanti sadā Gotamasāvakā.
yesaṃ divā ca ratto ca ahiṃsāya rato mano.

301 suppabuddhaṃ pabujjhanti sadā Gotamasāvakā
Expand Down Expand Up @@ -1277,10 +1276,10 @@ mā vo naḷaṃ va soto va Māro bhañji punappunaṃ.
338 yathāpi mūle anupaddave daḷhe
chinno pi rukkho punar eva rūhati
evam pi taṇhānusaye anūhate
nibbattati dukkham idaṃ punappunaṃ.
nibbattati dukkham idaṃ punappunaṃ

339 yassa chattiṃsatī sotā manāpassavanā bhusā,
vāhā vahanti duddiṭṭhaṃ saṃkappā rāganissitā.
339 yassa chattiṃsatī sotā manāpassavanā bhusā
vāhā vahanti duddiṭṭhaṃ saṃkappā rāganissitā

340 savanti sabbadā sotā latā ubbhijja tiṭṭhati
tañ ca disvā lataṃ jātaṃ mūlaṃ paññāya chindatha.
Expand Down Expand Up @@ -1374,7 +1373,7 @@ Taṇhāvaggo catuvīsatimo
ghāṇena saṃvaro sādhu, sādhu jivhāya saṃvaro.

361 kāyena saṃvaro sādhu, sādhu vācāya saṃvaro,
manasā saṃvaro sādhu, sādhu sabbattha saṃvaro
manasā saṃvaro sādhu, sādhu sabbattha saṃvaro,
sabbattha saṃvuto bhikkhu sabbadukkhā pamuccati.

362 hatthasaññato pādasaññato
Expand Down Expand Up @@ -1464,16 +1463,16 @@ vītaddaraṃ visaññuttaṃ tam ahaṃ brūmi brāhmaṇaṃ.
386 jhāyiṃ virajam āsīnaṃ katakiccaṃ anāsavaṃ
uttamatthaṃ anuppattaṃ tam ahaṃ brūmi brāhmaṇaṃ.

387 divā tapati ādicco, rattiṃ ābhāti candimā,
sannaddho khattyo tapati, jhāyī tapati brāhmaṇo,
387 divā tapati ādicco, rattiṃ ābhāti candimā.
sannaddho khattyo tapati, jhāyī tapati brāhmaṇo.
atha sabbam ahorattiṃ Buddho tapati tejasā.

388 bāhitapāpo ti brāhmaṇo
samacaryā samaṇo ti vuccati.
pabbājayam attano malaṃ
tasmā pabbajito ti vuccati.

389 na brāhmaṇassa hareyya
389 na brāhmaṇassa hareyya;
nâssa muñcetha brāhmaṇo,
dhī brāhmaṇassa hantāraṃ,
tato dhī y' assa muñcati.
Expand All @@ -1496,7 +1495,7 @@ na jaccā hoti brāhmaṇo,
yamhi saccañ ca dhammo ca
so sukhī so ca brāhmaṇo.

394 kin te jaṭāhi dummedha, kin te ajinasāṭiyā,
394 kin te jaṭāhi dummedha, kin te ajinasāṭiyā.
abbhantaran te gahanaṃ, bāhiraṃ parimajjasi.

395 paṃsukūladharaṃ jantuṃ
Expand Down
Loading

0 comments on commit 9ffefaf

Please sign in to comment.