-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
162 lines (116 loc) · 6.44 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
output: github_document
title: R toolbox for NACO and CJK name authority comparison
author: Alan Engel
date: "`r format(Sys.time(), '%d %B, %Y')`"
lang: "en-us"
output_format: github_document
bibliography: bibliography.bib
link-citations: true
---
The Library of Congress's [Authority File Comparison Rules (NACO Normalization)](https://www.loc.gov/aba/pcc/naco/normrule-2.html)
were finalized in 2009 for the specific purpose of determining whether or not names in the 4XX fields
can be added to the 1XX fields. If the comparision rules find that a name in a 4XX field already exists,
in normalized form, in the 1XX fields, it cannot be added.[@hickey2006naco][@loc2020descriptive]
Thanks to a splendid new R package, stringi, developed by Marek Gagolewski,[@R-stringi] the comparison rules
can be implemented using calls to the ICU4C library from the [International Components for Unicode](https://icu.unicode.org/).
A step-by-step walk through vignette shows how the comparison rules are executed using stringi functions.
This package is a sibling of a [R package viafr](https://github.com/kijinosu/viafr), which was forked
from Stefanie Schneider's viafr package on Github.[@R-viafr] In a related project, viafr development is
aimed at extracting information, for example, social networks based on coauthors, from the
[Virtual International Authority Files (VIAF)](https://viaf.org/). Normalization functionality
for that package will be built and maintained in rnaco.
# Features
## Functions
* get_transliteration_rules() Fetch transliteration rules for use in stringi::stri_trans_general()
* moji_transform() Normalizing transform based on MojiJoho variants
* multiform_transform() CJK normalizing transform with user-selectable form (JA, TW or ZH)
* naco_transform() This function follows the Authority File Comparison Rules using stringi[@R-stringi]
to implement the [International Components for Unicode](https://icu.unicode.org/). The derivation of
naco_transform() is detailed in a walk-through vignette, ```rnaco::naco```.
* show_cluster() Display ideographic clusters used in normalizing transform
* xml2df() This function reads a specially structured xml, and produces a tibble.
## Data
Of course, development based on NACO did not end in 2009 and is included in the
[NACO - Name Authority Cooperative Program](https://www.loc.gov/aba/pcc/naco/index.html). This package
is intended to facilitate following developments and includes test datasets from the
[CJK NACO Project](https://www.loc.gov/aba/pcc/naco/CJK.html). In particular, each build
of the rnaco project downloads several related files, which are then included as ```tibbles```[@R-tibble]
in package data.
### MARC code tables
Dataset | Description
------- | --------------
marcNotes | Names, XPaths and notes for MARC-8 code tables
marcBasicArabic | Basic Arabic code table, ISOcode=33 (July 2001)
marcBasicCyrillic | Basic Cyrillic code table, ISOcode=4E (January 2000)
marcBasicGreek | Basic Greek code table, ISOcode=53 (January 2000)
marcBasicHebrew | Basic Hebrew code table, ISOcode=32 (January 2000)
marcBasicLatinASCII | Basic Latin (ASCII) code table (January 2000) ISOcode=42
marcComponentInputMethodCharacters | Component Input Method Characters code table (February 6, 2003)
marcEastAsianIdeographsHan | East Asian Ideographs ('Han') code table (February 6, 2003)
marcEastAsianPunctuationMarks | East Asian Punctuation Marks code table (February 6, 2003)
marcExtendedArabic | Extended Arabic code table, ISO code 34 (July 2001)
marcExtendedCyrillic | Extended Cyrillic code table, ISO code 51 (January 2000)
marcExtendedLatinANSEL | Extended Latin (ANSEL) code table, ISO code 45 (January 2000, Updated September 2004)
marcGreekSymbols | Greek Symbols
marcJapaneseHiraganaandKatakana | Japanese Hiragana and Katakana code table (February 6, 2003)
marcKoreanHangul | Korean Hangul code table (February 6, 2003)
marcSubscripts | Subscripts
marcSuperscripts | Superscripts
### Datasets for naco_transform()
Dataset | Description
------- | --------------
numericrules | Transliteration rules for numeric decimals
step6rulestring | Transliteration rule for Step 6 of naco_transform()
step7lltranslitrule | Transliteration rule for Step 7 of naco_transform()
unconditionalMappings | Unconditional mappings for naco_transform()
### Datasets for CJK normalization
Dataset | Description
------- | --------------
unihan_variants | Unihan CJK variants
mojiMapdf | Mapping of Moji Joho
mojiVariants | Moji Joho variants
kyoikuKanji | Kanji by grade level in Japan
zhXiaoxueWenzi | Ideographs by grade level in PRC
twXiaoxueWenzi | Ideographs by grade level in TW
jctPrecedence | Precedence schemes for variant clusters
### Datasets for testing
This data comes from personal name authority records that were developed by the
[NACO CJK Funnel References Project](https://www.loc.gov/aba/pcc/naco/CJK.html) as a source of
Chinese, Japanese and Korean names for testing and development.
Dataset | Description
------- | --------------
cjkextractja | Personal name authority records - Japanese names
cjkextractko | Personal name authority records - Korean names
cjkextractzh | Personal name authority records - Chinese names
## Vignettes
* MARC Tables for NACO: For each build, this script downloads the [MARC code lists](https://www.loc.gov/marc/)
in XML format and extracts the code table listed above.
* rnaco::naco: Authority comparison (NACO) rules walk-through
* rnaco::nacoDatasets: Datasets for Authority File Comparison Rules (NACO Normalization)
* rnaco::unihanVariants: Unihan Variants Table for rnaco including analysis of their use in VAIF
normalization of CJK names
* rnaco::integrated-variant-map: An in-progress effort to combine NACO, Unihan variants and MojiJoho variants
into an internally coherent scheme for normalization
# Installation
You can install the latest version of __rnaco__ from
[github](https://github.com/kijinosu/rnaco) with:
```{r, install-instructions, eval=FALSE}
library(devtools)
devtools::install_github("kijinosu/rnaco")
```
# Other simlar work
* [pynaco](https://github.com/unt-libraries/pynaco) is a fork by University of North Texas Libraries
of Python sample code that was on the OCLC Research Website.
# R packages used for this package
* DataPackageR - Set up and put the pieces together [@R-DataPackageR]
* dplyr - [@R-dplyr]
* igraph - [@R-igraph]
* readxl - [@R-readxl]
* stringi - [@R-stringi]
* tibble - [@R-tibble]
* tidyr - [@R-tidyr]
* viafr(fork) - [@R-viafr]
* xml2 - [@R-xml2]
* xslt = [@R-xslt]
# References