scrape one specific journal

title	author	date
read me: convert_sakhrit-to-tei	Till Grallert	2018-11-05

The XSLT stylesheets in this repository provide a means to transform bibliographic information from archive.sakhrit.co into TEI XML.

scrape one specific journal

The process works in two steps:

The stylesheet sakhrit2tei_metadata.xsl scrapes one HTML file for each issue of the journal from sakhrit.
The stylesheet sakhrit2tei_main.xsl transforms the HTML files into TEI XML files for each issue.

scrape everything

Sakhrit's website provides different types of detail pages that all provide slightly different sets of bibliographic data.

contents.aspx?CID= + number: provide information on the issue level, including:
- author (if known), article title, number of first page
ArticlePages.aspx?ArticleID= + number: provide information on the article level, including:
- journal title, issue, date, place, author, facsimiles, issue URL, author URL
- missing information: article title
authorsArticles.aspx?AID= + number: aggregate information on the author level across periodicals
- Problems:
  1. the content of the page is limited to 30 entries. A JS provides links to further pages but they were not scraped with wget.
  2. many articles had no byline. Hence there is no author page for them.

1. scrape content with wget

2. extract bibliographic data with R

The R script r/convert_html-table.R reads all files following a certain naming pattern from a local directory, extracts bibliographic information, normalises dates, and saves everything as .csv and .rda. All functions for data conversion are kept in a separate script (functions_sakhrit.R).

ingest data: `contents.aspx?CID=` + number

The script extracts bibliographic information using the following naming scheme:

journal.title
journal.id
issue
issue.id
date.publication
date.publication.iso
author.name
article.title
article.url
article.id
page.from

save data

The resulting data frames or tibbles can be converted to xml. Each bibliographic entry is then transformed into TEI P5 <biblStruct>s using xslt/convert_r-output-to-tei.xsl.

3. prepare data with R

Slice data: in order to better understand changes over time, the function f.date.slice.period allows to slice all data into rolling periods of any number of years.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
data		data
r		r
xslt		xslt
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scrape one specific journal

scrape everything

1. scrape content with wget

2. extract bibliographic data with R

ingest data: `contents.aspx?CID=` + number

save data

3. prepare data with R

About

Releases

Packages

Languages

License

OpenArabicPE/convert_sakhrit-to-tei

Folders and files

Latest commit

History

Repository files navigation

scrape one specific journal

scrape everything

1. scrape content with wget

2. extract bibliographic data with R

ingest data: contents.aspx?CID= + number

save data

3. prepare data with R

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

ingest data: `contents.aspx?CID=` + number

Packages