Skip to content

Latest commit

 

History

History
74 lines (63 loc) · 2.85 KB

README.md

File metadata and controls

74 lines (63 loc) · 2.85 KB

Semantic Scholar Open Research Corpus

Synopsis

This representation of the full Semantic Scholar corpus offers data relating to papers crawled from the web and subjected to a number of filters. There are over 45 million published research papers in Computer Science, Neuroscience, and Biomedical fields provided as json objects, one per line. Papers are grouped in batches and shared as a collection of gzipped files; each file is about 990 MB, and the total collection is about 46 GB. A sample file of about 100 records (98 KB) is also provided, as is a copy of the license agreement. The manifest includes a list of the available files.

ELasticsearch

the folder ./elasticsearch contains the indexed data files for a local single-node elasticsearch cluster (docker image).

Files and Folders

corpus-2019-01-31/s2-corpus-00.gz
corpus-2019-01-31/s2-corpus-01.gz
corpus-2019-01-31/s2-corpus-02.gz
corpus-2019-01-31/s2-corpus-03.gz
corpus-2019-01-31/s2-corpus-04.gz
corpus-2019-01-31/s2-corpus-05.gz
corpus-2019-01-31/s2-corpus-06.gz
corpus-2019-01-31/s2-corpus-07.gz
corpus-2019-01-31/s2-corpus-08.gz
corpus-2019-01-31/s2-corpus-09.gz
corpus-2019-01-31/s2-corpus-10.gz
corpus-2019-01-31/s2-corpus-11.gz
corpus-2019-01-31/s2-corpus-12.gz
corpus-2019-01-31/s2-corpus-13.gz
corpus-2019-01-31/s2-corpus-14.gz
corpus-2019-01-31/s2-corpus-15.gz
corpus-2019-01-31/s2-corpus-16.gz
corpus-2019-01-31/s2-corpus-17.gz
corpus-2019-01-31/s2-corpus-18.gz
corpus-2019-01-31/s2-corpus-19.gz
corpus-2019-01-31/s2-corpus-20.gz
corpus-2019-01-31/s2-corpus-21.gz
corpus-2019-01-31/s2-corpus-22.gz
corpus-2019-01-31/s2-corpus-23.gz
corpus-2019-01-31/s2-corpus-24.gz
corpus-2019-01-31/s2-corpus-25.gz
corpus-2019-01-31/s2-corpus-26.gz
corpus-2019-01-31/s2-corpus-27.gz
corpus-2019-01-31/s2-corpus-28.gz
corpus-2019-01-31/s2-corpus-29.gz
corpus-2019-01-31/s2-corpus-30.gz
corpus-2019-01-31/s2-corpus-31.gz
corpus-2019-01-31/s2-corpus-32.gz
corpus-2019-01-31/s2-corpus-33.gz
corpus-2019-01-31/s2-corpus-34.gz
corpus-2019-01-31/s2-corpus-35.gz
corpus-2019-01-31/s2-corpus-36.gz
corpus-2019-01-31/s2-corpus-37.gz
corpus-2019-01-31/s2-corpus-38.gz
corpus-2019-01-31/s2-corpus-39.gz
corpus-2019-01-31/s2-corpus-40.gz
corpus-2019-01-31/s2-corpus-41.gz
corpus-2019-01-31/s2-corpus-42.gz
corpus-2019-01-31/s2-corpus-43.gz
corpus-2019-01-31/s2-corpus-44.gz
corpus-2019-01-31/s2-corpus-45.gz
corpus-2019-01-31/s2-corpus-46.gz
sample-S2-records.gz
license.txt
manifest.txt

Research and Usecases

License Information

Data Source

https://api.semanticscholar.org/corpus/
Waleed Ammar et al. 2018. Construction of the Literature Graph in Semantic Scholar. NAACL.

Publications