towards completion

ETCBC · Dec 24, 2021 · 9ffefaf · 9ffefaf
1 parent dd9af05
commit 9ffefaf
Show file tree

Hide file tree

Showing 22 changed files with 14,740 additions and 2,818 deletions.
diff --git a/README.md b/README.md
@@ -13,7 +13,7 @@
 ## About
 
 This is the
-[text-fabric](https://github.com/Dans-labs/text-fabric/wiki)
+[text-fabric](https://github.com/annotation/text-fabric)
 representation of the Dhammapada in the edition with Latin translation by V. Fausböll, 1900.
 
 See [about](docs/about.md) for more information about this textual source.
@@ -33,14 +33,14 @@ The conversion to Text-Fabric is joint work of
     conversion to the Text-Fabric-sphere.
 
 There is more information on the
-[transcription](docs/transcription.md)
+[transcription](docs/transcription.md).
 
 ## How to use
 
 This data can be processed by 
 [Text-Fabric](https://annotation.github.io/text-fabric/tf).
 
-Text-Fabric will automatically download the BHSA data.
+Text-Fabric will automatically download the corpus data.
 
 After installing Text-Fabric, you can start the Text-Fabric browser by this command
 

diff --git a/docs/about.md b/docs/about.md
@@ -20,6 +20,8 @@ It is written in the Pāli language, which is a close relative of Sanskrit.
 The text that is the source of this dataset rests on the work of
 [Viggo Fausböll](https://en.wikipedia.org/wiki/Viggo_Fausböll) who translated the
 Dhammapada into Latin in 1855.
+We took the text from the 1900 edition of his book in which he included the
+original Pāli text in Latin script.
 
 
 field | value
@@ -32,7 +34,7 @@ publisher | `Luzac & Co., Publishers to the India Office`
 publisher address | `46, Great Russell Street, W.C. London`
 published | `1900`
 
-The cover page of the edition that is the source of this dataset:
+The cover page of that edition is:
 
 ![cover](images/cover.png)
 
@@ -42,15 +44,13 @@ The Dhammapada is divided in *vaggas* which are divided in *stanzas*.
 There are 26 vaggas and 423 stanzas, which are numbered consecutively throughout the whole
 work.
 
-The book uses the latin script for the Pāli text.
-
 As an example, here are the first 7 stanzas of the first vagga in Pāli:
 
 ![pali7](images/pali7.png)
 
 and here the same stanzas in Latin:
 
-![pali7](images/pali7.png)
+![latin7](images/latin7.png)
 
 ## Additional resources
 
@@ -61,6 +61,17 @@ and here the same stanzas in Latin:
 
 # The conversion
 
-The conversion program in in [tfFromTxt.py](programs/tfFromTxt.py).
+The conversion program in in [tfFromTxt.py](../programs/tfFromTxt.py).
 It can be seen in action in a Jupyter notebook: 
 [convert.ipynb](https://nbviewer.org/github/etcbc/dhammapada/blob/master/programs/convert.ipynb)
+
+# Progress
+
+*   **2021-12-24** First version. The text-fabric features correspond to the plain texts and
+    the obvious structure in vaggas, stanzas, sentences, and clauses.
+    No attempts to add linguistic features have been made so far.
+
+# Future work
+
+We can use the current dataset to generate workflows to annotate the texts
+with linguistic features, such as lemma, part-of-speech, etc.
diff --git a/programs/convert.ipynb b/programs/convert.ipynb
diff --git a/programs/tfFromTxt.py b/programs/tfFromTxt.py
@@ -7,7 +7,7 @@
 from tf.convert.walker import CV
 
 
-VERSION = "0.1"
+VERSION = "0.2"
 SLOT_TYPE = "word"
 GENERIC = dict(
     language="pli,lat",
@@ -26,6 +26,7 @@
     yearPublished="1900",
     copynote1="Digitisation supported by Shri Brihad Bhartiya Samaj 20 February 2020",
     stamp="50480",
+    version=VERSION,
 )
 OTEXT = {
     "fmt:text-orig-full": "{palipre/latinpre}{pali/latin}{palipost/latinpost}",
@@ -34,7 +35,15 @@
     "sectionTypes": "vagga,stanza",
     "sectionFeatures": "n,n",
 }
-INT_FEATURES = {"n", "trans", "extrastanza", "quote", "uncertain", "clarity"}
+INT_FEATURES = {
+    "n",
+    "trans",
+    "extrastanza",
+    "quote",
+    "uncertain",
+    "clarity",
+    "freq_occ",
+}
 
 FEATURE_META = dict(
     n=dict(
@@ -75,8 +84,7 @@
         format="1 (=true) or absent (=false)",
     ),
     quote=dict(
-        description="word is inside a quote",
-        format="1 (=true) or absent (=false)"
+        description="word is inside a quote", format="1 (=true) or absent (=false)"
     ),
     uncertain=dict(
         description=(
@@ -96,6 +104,10 @@
         description="whether the node belongs to the original text or a translation",
         format="1 (=Latin translation) or absent (=Pali original)",
     ),
+    freq_occ=dict(
+        description="the number of times that this word occurs",
+        format="positive integer",
+    ),
 )
 
 
@@ -410,7 +422,9 @@ def tokenize(self, src=None):
                                     words.append([preBracket, False])
                                     words.append([bracketed + postBracket, False])
                                 else:
-                                    words.append([preBracket + bracketed + postBracket, False])
+                                    words.append(
+                                        [preBracket + bracketed + postBracket, False]
+                                    )
                                 break
                             else:
                                 if realPre:
@@ -441,6 +455,8 @@ def tokenize(self, src=None):
                     if word == "-":
                         tokens[-1][-1] += " - "
                         affixes[POST][tokens[-1][-1]] += 1
+                        cur["sentence"] += 1
+                        cur["clause"] += 1
                         continue
 
                     if word == "":
@@ -502,11 +518,12 @@ def tokenize(self, src=None):
                     elif "”" in postWord:
                         cur["quote"] = False
                         postWord = postWord.replace("”", '"')
+
+                    tokens[-1][-1] = postWord
                     if (
                         "," in postWord
                         or ";" in postWord
                         or ":" in postWord
-                        or "-" in postWord
                     ):
                         cur["clause"] += 1
                     if "." in postWord or "?" in postWord:
@@ -692,6 +709,9 @@ def makeTf(self):
         chunks = self.chunks
         cv = CV(Fabric(locations=TF_DIR))
 
+        freqOcc = collections.Counter()
+        wordIndex = {}
+
         def director(cv):
             SENTENCE = "sentence"
             CLAUSE = "clause"
@@ -781,6 +801,8 @@ def director(cv):
                                         )
                                     )
                                     cv.feature(wordNode, **wordFeatures)
+                                    freqOcc[word] += 1
+                                    wordIndex[wordNode] = word
 
                                 myLast[CLAUSE] = clause
 
@@ -800,6 +822,9 @@ def director(cv):
                             cv.terminate(n)
                 cv.terminate(vaggaNode)
 
+            for (n, word) in wordIndex.items():
+                cv.feature(n, freq_occ=freqOcc[word])
+
         return cv.walk(
             director,
             SLOT_TYPE,
@@ -815,11 +840,17 @@ def loadTf(self):
         allFeatures = TF.explore(silent=True, show=True)
         loadableFeatures = allFeatures["nodes"] + allFeatures["edges"]
         api = TF.load(loadableFeatures, silent=False)
-        if api:
-            print(f"max node = {api.F.otype.maxNode}")
-            print("Frequencies of words")
-            for (word, n) in api.F.pali.freqList()[0:20]:
-                print(f"{n:>6} x {word}")
+        F = api.F
+
+        if not api:
+            return False
+
+        print(f"max node = {api.F.otype.maxNode}")
+        print("Frequencies of words")
+        for (word, n) in F.pali.freqList()[0:20]:
+            print(f"{n:>6} x {word}")
+        for w in F.otype.s("word")[0:50]:
+            print(f"{F.freq_occ.v(w):>4} x {F.pali.v(w) or F.latin.v(w)}")
 
     def interLinking(self):
         pass
diff --git a/sources/pali.txt b/sources/pali.txt
@@ -214,7 +214,7 @@ yo ca sīlavataṃ gandho vāti devesu uttamo.
 sammadaññāvimuttānaṃ Māro maggaṃ na vindati.
 
 58 yathā saṃkāradhānasmiṃ ujjhitasmiṃ mahāpathe
-padumaṃ tattha jāyetha sucigandhaṃ manoramaṃ,
+padumaṃ tattha jāyetha sucigandhaṃ manoramaṃ;
 
 59 evaṃ saṃkārabhūtesu andhabhūte puthujjane
 atirocati paññāya sammāsambuddhasāvako.
@@ -412,19 +412,19 @@ cattāro dhammā vaḍḍhanti: āyu vaṇṇo sukhaṃ balaṃ.
 110 yo ca vassasataṃ jīve dussīlo asamāhito -
 ekāhaṃ jīvitaṃ seyyo sīlavantassa jhāyino. 
 
-111 yo ca vassasataṃ jīve duppañño asamāhito 
+111 yo ca vassasataṃ jīve duppañño asamāhito -
 ekāhaṃ jīvitaṃ seyyo paññāvantassa jhāyino. 
 
 112 yo ca vassasataṃ jīve kusīto hīnavīriyo 
 ekāhaṃ jīvitaṃ seyyo vīryam ārabhato daḷhaṃ. 
 
-113 yo ca vassasataṃ jīve apassaṃ udayavyayaṃ 
+113 yo ca vassasataṃ jīve apassaṃ udayavyayaṃ -
 ekāhaṃ jīvitaṃ seyyo passato udayavyayaṃ. 
 
-114 yo ca vassasataṃ jīve apassaṃ amataṃ padaṃ 
+114 yo ca vassasataṃ jīve apassaṃ amataṃ padaṃ -
 ekāhaṃ jīvitaṃ seyyo passato amataṃ padaṃ. 
 
-115 yo ca vassasataṃ jīve apassaṃ dhammam uttamaṃ 
+115 yo ca vassasataṃ jīve apassaṃ dhammam uttamaṃ -
 ekāhaṃ jīvitaṃ seyyo passato dhammaṃ uttamam. 
 
 Sahassavaggo aṭṭhamo
@@ -509,7 +509,7 @@ evaṃ jarā ca maccu ca āyuṃ pācenti pāṇinaṃ.
 sehi kammehi dummedho aggidaḍḍho va tappati. 
 
 137 yo daṇḍena adaṇḍesu appaduṭṭhesu dussati 
-dasann' aññataraṃ ṭhānaṃ khippam eva nigacchati. 
+dasann' aññataraṃ ṭhānaṃ khippam eva nigacchati: 
 
 138 vedanaṃ pharusaṃ jāniṃ sarīrassa ca bhedanaṃ 
 garukaṃ vâpi ābādhaṃ cittakkhepaṃ va pāpuṇe. 
@@ -565,7 +565,7 @@ kāpotakāni aṭṭhīni tāni disvāna kā rati.
 yattha jarā ca maccu ca māno makkho ca ohito. 
 
 151 jīranti ve rājarathā sucittā 
-atho sarīram pi jaraṃ upeti 
+atho sarīram pi jaraṃ upeti. 
 satañ ca dhammo na jaraṃ upeti 
 santo have sabbhi pavedayanti. 
 
@@ -575,7 +575,6 @@ maṃsāni tassa vaḍḍhanti paññā tassa na vaḍḍhati.
 153 anekajātisaṃsāraṃ sandhāvissaṃ anibbisaṃ 
 gahakārakaṃ gavesanto, dukkhā jāti punappunaṃ. 
 
-
 154 gahakāraka diṭṭḥo si puna gehaṃ na kāhasi, 
 sabbā te phāsukā bhaggā gahakūṭaṃ visaṃkhitaṃ, 
 visaṃkhāragataṃ cittaṃ taṇhānaṃ khayam ajjhagā. 
@@ -599,7 +598,7 @@ ath' aññam anusāseyya, na kilisseyya paṇḍito.
 159 attānañ ce tathā kayrā yath' aññam anusāsati 
 sudanto vata dametha, attā hi kira duddamo. 
 
-160 attā hi attano nātho, ko hi nātho paro siyā; 
+160 attā hi attano nātho, ko hi nātho paro siyā,
 attanā hi sudantena nāthaṃ labhati dullabhaṃ. 
 
 161 attanā va kataṃ pāpaṃ 
@@ -672,7 +671,7 @@ Lokavaggo terasamo
 
 14. Buddhavagga
 179 yassa jitaṃ nâvajīyati 
-jitaṃ assa no yāti koci loke, 
+jitaṃ assa no yāti koci loke 
 tam buddham anantagocaraṃ 
 apadaṃ kena padena nessatha. 
 
@@ -702,7 +701,7 @@ mattaññutā ca bhattasmiṃ pantañ ca sayanāsanaṃ
 adhicitte ca āyogo etaṃ Buddhāna sāsanaṃ. 
 
 186 na kahāpaṇavassena titti kāmesu vijjati, 
-"appassādā dukhā kāmā" iti viññāya paṇḍito. 
+"appassādā dukhā kāmā" iti viññāya paṇḍito 
 
 187 Api dibbesu kāmesu ratiṃ so nâdhigacchati, 
 taṇhakkhayarato hoti sammāsambuddhasāvako. 
@@ -723,7 +722,7 @@ sammapaññāya passati
 191 dukkhaṃ dukkhasamuppādaṃ 
 dukkhassa ca atikkamaṃ 
 ariyañ c' aṭṭhaṅgikaṃ maggaṃ 
-dukkhūpasamagāminaṃ. 
+dukkhūpasamagāminaṃ 
 
 192 etaṃ kho saraṇaṃ khemaṃ etaṃ saraṇam uttamaṃ 
 etaṃ saraṇaṃ āgamma sabbadukkhā pamuccati. 
@@ -735,7 +734,7 @@ yattha so jāyatī dhīro taṃ kulaṃ sukham edhati.
 sukhā saṃghassa sāmaggī samaggānaṃ tapo sukho. 
 
 195 pūjārahe pūjayato Buddhe yadiva sāvake 
-papañcasamatikkante tiṇṇasokapariddave, 
+papañcasamatikkante tiṇṇasokapariddave 
 
 196 te tādise pūjayato nibbute akutobhaye 
 na sakkā puññaṃ saṃkhātuṃ im' ettam api kenaci.
@@ -864,7 +863,7 @@ mitabhāṇinam pi nindanti, n' atthi loke anindito.
 ekantaṃ nindito poso ekantaṃ vā pasaṃsito. 
 
 229 yañ ce viññū pasaṃsanti anuvicca suve suve 
-acchiddavuttiṃ medhāviṃ paññāsīlasamāhitaṃ. 
+acchiddavuttiṃ medhāviṃ paññāsīlasamāhitaṃ 
 
 230 nekkhaṃ jambonadassêva ko taṃ ninditum arhati, 
 devâpi naṃ pasaṃsanti, Brahmunâpi pasaṃsito. 
@@ -1119,7 +1118,7 @@ yesaṃ divā ca ratto ca niccaṃ Saṃgha-gatā sati.
 299 suppabuddhaṃ pabujjhanti sadā Gotamasāvakā 
 yesaṃ divā ca ratto ca niccaṃ kāyagatā sati. 
 
-300 suppabuddhaṃ pabujjhanti sadā Gotamasāvakā 
+300 suppabuddhaṃ pabujjhanti sadā Gotamasāvakā. 
 yesaṃ divā ca ratto ca ahiṃsāya rato mano. 
 
 301 suppabuddhaṃ pabujjhanti sadā Gotamasāvakā 
@@ -1277,10 +1276,10 @@ mā vo naḷaṃ va soto va Māro bhañji punappunaṃ.
 338 yathāpi mūle anupaddave daḷhe 
 chinno pi rukkho punar eva rūhati 
 evam pi taṇhānusaye anūhate 
-nibbattati dukkham idaṃ punappunaṃ. 
+nibbattati dukkham idaṃ punappunaṃ 
 
-339 yassa chattiṃsatī sotā manāpassavanā bhusā, 
-vāhā vahanti duddiṭṭhaṃ saṃkappā rāganissitā. 
+339 yassa chattiṃsatī sotā manāpassavanā bhusā 
+vāhā vahanti duddiṭṭhaṃ saṃkappā rāganissitā 
 
 340 savanti sabbadā sotā latā ubbhijja tiṭṭhati 
 tañ ca disvā lataṃ jātaṃ mūlaṃ paññāya chindatha. 
@@ -1374,7 +1373,7 @@ Taṇhāvaggo catuvīsatimo
 ghāṇena saṃvaro sādhu, sādhu jivhāya saṃvaro. 
 
 361 kāyena saṃvaro sādhu, sādhu vācāya saṃvaro, 
-manasā saṃvaro sādhu, sādhu sabbattha saṃvaro 
+manasā saṃvaro sādhu, sādhu sabbattha saṃvaro, 
 sabbattha saṃvuto bhikkhu sabbadukkhā pamuccati. 
 
 362 hatthasaññato pādasaññato 
@@ -1464,16 +1463,16 @@ vītaddaraṃ visaññuttaṃ tam ahaṃ brūmi brāhmaṇaṃ.
 386 jhāyiṃ virajam āsīnaṃ katakiccaṃ anāsavaṃ 
 uttamatthaṃ anuppattaṃ tam ahaṃ brūmi brāhmaṇaṃ. 
 
-387 divā tapati ādicco, rattiṃ ābhāti candimā, 
-sannaddho khattyo tapati, jhāyī tapati brāhmaṇo, 
+387 divā tapati ādicco, rattiṃ ābhāti candimā. 
+sannaddho khattyo tapati, jhāyī tapati brāhmaṇo. 
 atha sabbam ahorattiṃ Buddho tapati tejasā. 
 
 388 bāhitapāpo ti brāhmaṇo 
 samacaryā samaṇo ti vuccati. 
 pabbājayam attano malaṃ 
 tasmā pabbajito ti vuccati. 
 
-389 na brāhmaṇassa hareyya 
+389 na brāhmaṇassa hareyya; 
 nâssa muñcetha brāhmaṇo, 
 dhī brāhmaṇassa hantāraṃ, 
 tato dhī y' assa muñcati. 
@@ -1496,7 +1495,7 @@ na jaccā hoti brāhmaṇo,
 yamhi saccañ ca dhammo ca 
 so sukhī so ca brāhmaṇo. 
 
-394 kin te jaṭāhi dummedha, kin te ajinasāṭiyā, 
+394 kin te jaṭāhi dummedha, kin te ajinasāṭiyā. 
 abbhantaran te gahanaṃ, bāhiraṃ parimajjasi. 
 
 395 paṃsukūladharaṃ jantuṃ