Skip to content

Downloading Resource Files

ernestogimeno edited this page Sep 17, 2020 · 10 revisions


To install the Python library dependencies:

pip install -r requirements.txt
python3 -m spacy download en_core_web_sm

Download Resource Files from S3

Resource files for the corpus have already been downloaded and processed. The following provides means to access that.

First, you need to have your .aws directory configured with valid keys, etc., for S3 access before the following script will work. The bucket is readable by the public, even so the boto3 library for requires keys for a valid AWS user account.

Then adapt the bin/ script as example code to download the PDF files (open access publications) and TXT files (raw extracted text) from the public S3 bucket. You will need to modify or adapt that code.

Collecting Open Access PDFs

Download the corpus PDFs and other resource files:

python bin/ --logger errors.txt

The PDF files get stored in the resources/pub/pdf subdirectory.

Extracting text from PDFs

We use Parsr to extract text and JSON from research publications. The quickest way to install and run the Parsr API is through the docker image:

docker pull axarev/parsr

To run the API, issue:

docker run -p 3001:3001 axarev/parsr

-- The advanced guide is available here. --

Then run the script to extract text and JSON from the PDF files:

python bin/ localhost:3001

The outputs will be saved in json and text folders. It might be quite time-consuming though. Please be patient.

Also, there is a known error rate associated with Parsr. As a contingency, use the following script to extract text for the PDF files which Parsr does not handle:

python bin/

Upload PDF/JSON/TXT files

For those on the NYU-CI team who update the corpus:

Upload to the public S3 bucket:

  • PDF files (open access publications)
  • JSON files (semi-structured extracted text)
  • TXT files (raw text)
python bin/

S3 Bucket Specs

View the public AWS S3 Bucket richcontext online:

The directory structure of the public S3 bucket is similar to the directory structure used for resources in this repo:

- richcontext
  - corpus_docs
    - pub
      - pdf
      - json
      - txt


Known issues when running this script on the v0.1.5 corpus file:

  • unable to download publication pdf files embedded in an HTML page
    (for example: on epdf links in,

kudos: @philipskokoh, @srand525, @JasonZhangzy1757

Clone this wiki locally