Skip to content

Commit

Permalink
data version 0.3-1.0.4
Browse files Browse the repository at this point in the history
  • Loading branch information
dirkroorda committed Mar 20, 2019
1 parent f20baa3 commit 43c36d1
Show file tree
Hide file tree
Showing 75 changed files with 2,920,980 additions and 316 deletions.
3 changes: 2 additions & 1 deletion analysis/ummama.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -516,7 +516,8 @@
"introNouns = collections.Counter()\n",
"\n",
"for (line, um, ma1, word, ma2) in results:\n",
" introNouns[F.symr.v(word)] += 1\n",
" strippedWord = L.d(word, otype='sign')[:-1]\n",
" introNouns[F.symr.v(strippedWord)] += 1\n",
"\n",
"len(introNouns)"
]
Expand Down
11 changes: 11 additions & 0 deletions characters/mapping.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
1(esze3) 𒑘
1(gesz'u) 𒐞
1(gesz2) 𒐕
1(iku) 𒀸
1(szar2) 𒊹
1(u) 𒌋
1/2(disz) 𒈦
Expand All @@ -23,6 +24,7 @@
2(gesz'u) 𒐟
2(gesz2) 𒐖
2(gisz) 𒄑
2(iku) 𒐀
2(szar2) 𒐣
2(u) 𒌋
2/3(disz) 𒑛
Expand All @@ -32,8 +34,10 @@
3(bur'u) 𒐶
3(bur3) 𒌋𒌋𒌋
3(disz) 𒐈
3(esze3) 𒀸𒌋
3(gesz'u) 𒐠
3(gesz2) 𒐗
3(iku) 𒐁
3(u) 𒌋
4(asz) 𒐂
4(ban2) 𒑒
Expand All @@ -43,6 +47,7 @@
4(disz) 𒐉
4(gesz'u) 𒐡
4(gesz2) 𒐘
4(iku) 𒐂
4(u) 𒐏
5(asz) 𒐃
5(ban2) 𒑔
Expand All @@ -51,24 +56,29 @@
5(bur3) 𒐐
5(disz) 𒐊
5(gesz2) 𒐙
5(iku) 𒐃
5(u) 𒐐
5/6(disz) 𒑜
6(asz) 𒐄
6(bur3) 𒐑
6(disz) 𒐋
6(gesz2) 𒐚
6(iku) 𒐄
7(asz) 𒐅
7(bur3) 𒐒
7(disz) 𒐌
7(gesz2) 𒐛
7(iku) 𒐅
8(asz) 𒐆
8(bur3) 𒐓
8(disz) 𒐍
8(gesz2) 𒐜
8(iku) 𒐆
9(asz) 𒐇
9(bur3) 𒐔
9(disz) 𒐎
9(gesz2) 𒐝
9(iku) 𒐇
A 𒀀
AB 𒀊
AD 𒀜
Expand Down Expand Up @@ -366,6 +376,7 @@ el3 𒀭
elam 𒉏
em 𒅎
eme 𒅴
eme6 𒀲𒊩
en 𒂗
en6 𒅔
engar 𒀳
Expand Down
14 changes: 10 additions & 4 deletions docs/transcription.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ There are several types of sign, stored in the feature `type`.
type | examples | description
------- | ------ | ------
`reading` | `ma` `qa2` | normal sign with a reading (lowercase)
`unknown` | `x` | representation of an unknown sign
`unknown` | `x` `n` | representation of an unknown sign, the `n` stands for an unknown numeral
`numeral` | `5(disz)` `5/6(disz)` | a numeral, either with a repeat or with a fraction
`ellipsis` | `...` | representation of an unknown number of missing signs
`grapheme` | `ARAD2` `GAN2` | sign given as a grapheme (uppercase)
Expand Down Expand Up @@ -291,11 +291,16 @@ Simple signs may be *augmented* with *flags* (see below).

### Unknown signs ###

The letters `x` and `X` in isolation stand for an unknown signs.
The letters `x` and `X`, `n` and `N` in isolation stand for an unknown signs.

The *type* of such signs is `unknown`.

If the value is `x`, it will stored in **reading**, if it is `X` in **grapheme**.
If the value is `x` or `n`, it will stored in **reading**, if it is `X` or `N` in **grapheme**.

The `x` and `X` stand for completely unknown signs, the `n` and `N` stand for unknown signs
of which it is known that they are numerals.

**N.B:** See under numerals below, where `n` plays a slightly different role.

### Ellipsis ###

Expand All @@ -318,7 +323,8 @@ Numeric signs may also be preceded with a *fraction*:
We store the integral number before the brackets in the feature **repeat**,
and the fraction in the feature **fraction**.

If the repeat is `n`, it means that a number is missing.
If the repeat is `n`, it means that the amount of repetition is uncertain
or that a repetition is missing.
We store it as `repeat` = `-1`, so repeats always have an integer value.

In a numeral, within the brackets you find the **reading** or **grapheme**,
Expand Down
463 changes: 237 additions & 226 deletions programs/checks.ipynb

Large diffs are not rendered by default.

128 changes: 66 additions & 62 deletions programs/mapReadings.ipynb

Large diffs are not rendered by default.

32 changes: 9 additions & 23 deletions programs/tfFromATF.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@
BASE = os.path.expanduser('~/github')
ORG = 'Nino-cunei'
REPO = 'oldbabylonian'
VERSION_SRC = '0.2'
VERSION_TF = '1.0.3'
VERSION_SRC = '0.3'
VERSION_TF = '1.0.4'
REPO_DIR = f'{BASE}/{ORG}/{REPO}'

TRANS_DIR = f'{REPO_DIR}/sources/cdli/transcriptions'
Expand All @@ -28,17 +28,9 @@
TF_DIR = f'{REPO_DIR}/tf'
OUT_DIR = f'{TF_DIR}/{VERSION_TF}'

# SOURCE FIXES

SRC_FIXES = dict(
bub='dub',
umi='um i',
ura='ra',
szii='szi',
)
# CHARACTERS

UNMAPPABLE = {'x', 'X', '...'}
UNMAPPABLE = {'x', 'X', 'n', 'N', '...'}

prime = "'"
ellips = '…'
Expand All @@ -51,7 +43,7 @@
't,': 'ţ',
}

unknownStr = 'xX'
unknownStr = 'xXnN'
unknownSet = set(unknownStr)

lowerLetterStr = 'abcdefghijklmnopqrstuvwyz' + ''.join(emphatic.values())
Expand Down Expand Up @@ -657,12 +649,6 @@ def getMapping():


def getSources():
if os.path.exists(OUT_DIR):
rmtree(OUT_DIR)
os.makedirs(OUT_DIR, exist_ok=True)

# list all sources

return tuple(
os.path.splitext(os.path.basename(f))[0]
for f in glob(f'{IN_DIR}/*.txt')
Expand Down Expand Up @@ -693,6 +679,11 @@ def checkSane(line):


def convert():
if generateTf:
if os.path.exists(OUT_DIR):
rmtree(OUT_DIR)
os.makedirs(OUT_DIR, exist_ok=True)

cv = getConverter()

return cv.walk(
Expand Down Expand Up @@ -1708,11 +1699,6 @@ def getParts(word):
if isNumbered:
ln = isNumbered.group(1)
recentTrans = isNumbered.group(2)
for (pat, rep) in SRC_FIXES.items():
newRecentTrans = recentTrans.replace(pat, rep)
if newRecentTrans != recentTrans:
warnings[f'source fix: {pat} => {rep}'][src].add((i, line, pNum, None))
recentTrans = newRecentTrans

else:
errors[f'line: not numbered'][src].add((i, line, pNum, None))
Expand Down
Loading

0 comments on commit 43c36d1

Please sign in to comment.