Skip to content

Latest commit

 

History

History
47 lines (38 loc) · 8.17 KB

datasets.md

File metadata and controls

47 lines (38 loc) · 8.17 KB

Curated Datasets for the Slingshot Competition

Slingshot’s aim for using curated datasets is to ensure meaningful data is stored and retrieved from the Filecoin Network. The use-cases don’t need to be complex and can be proprietary in nature for applications.

There are a wide variety of public data sets that can be leveraged for this challenge - a sampling is shown in the table below.

If you would like to use a dataset that you don't see listed here, please submit a PR to add the dataset to this table.

Name Descriptions Size Format URL
COVID-19 Open Research Dataset An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House 19 GB JSON https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
Chest X-Ray Images (Pneumonia) 5,863 images, 2 categories 2.29 GB JPEG https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
Huge Stock Market Dataset Historical daily prices and volumes of all U.S. stocks and ETFs 772 MB CSV https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs
Condensed Movies A large-scale video dataset, featuring clips from movies with detailed captions. 250 GB Video https://www.robots.ox.ac.uk/~vgg/research/condensed-movies/
USENET (2005-2011) Compressed USENET posts 36 GB Text http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
Sloan Digital Sky Survey Three dimensional view of the universe 273 TB Various https://www.sdss.org/
GHTorrent Project a scalable, queriable, offline mirror of data offered through the Github REST API. 18TB MySQL https://ghtorrent.org/
Free Music Archive 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres 879 GB MP3 https://github.com/mdeff/fma
Open Images Dataset 9 million URLs to images that have been annotated with labels spanning over 6000 categories 18 TB PNG https://storage.googleapis.com/openimages/web/index.html
Internet Archive a digital library of Internet sites and other cultural artifacts in digital form 45 PB Various https://archive.org/
Common Crawl An open repository of web crawl data 235 TB WARC https://commoncrawl.org/
Noisy speech database Used for training speech enhancement algorithms and TTS models 14 GB WAV https://datashare.is.ed.ac.uk/handle/10283/2791
NFL play-by-play The data has three tables: teams, players, and plays. 2.54 GB Text https://www.dolthub.com/repositories/Liquidata/nfl-play-by-play
NYC Trip Record Data include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. 267 GB CSV https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
National Cancer Institute Cancer data for analysis 18.46 TB JSON https://portal.gdc.cancer.gov/repository
Public Blockchain Datasets Blockchain data from cryptocurrencies Bitcoin, Ethereum, Dodgecoin, ZCash, Litecoin, Dash, Bitcoin Cash, Ethereum Classic, Tezos, Hedera Hashgraph, IoTex. 9 TB Various https://github.com/blockchain-etl/public-datasets
Landsat 8 Multispectral time series satellite imagery of all land on Earth since 2013 1.3 PB (estimated) GeoTIFF + metadata - sample scene https://registry.opendata.aws/landsat-8/#usageexamples
Docker Images Docker container images that are published on Docker Hub 167 TB images https://hub.docker.com/
Filecoin Proofs - 224 GB - https://proofs.filecoin.io/
Filecoin Trusted Setup - 2.05 TB - https://trusted-setup.filecoin.io/
Audius - GB MP3 https://www.audius.com/
Flickr Commons The key goal of The Commons is to share hidden treasures from the world's public photography archives. 50 TB jpeg https://www.flickr.com/commons
Arxiv Scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, and more. - PDF https://arxiv.org/
Audius An American decentralized music platform developing the first community-owned and artist-controlled Music sharing protocol. - MP3 https://audius.co/
Blackbird Dataset A large-scale dataset for UAV perception in aggressive flight 4.79 TB - https://academictorrents.com/details/eb542a231dbeb2125e4ec88ddd18841a867c2656
Linux ISO Linux ISO Images - ISO https://www.linuxlookup.com/linux_iso
ArchLinux ArchLinux packages repository 56 GB Various https://wiki.archlinux.org/index.php/Mirrors
CentOS CentOS packages repository 200 GB Various http://mirror.sesp.northwestern.edu/centos/
Data is Plural A variety of public, structured data sets. - Various https://tinyletter.com/data-is-plural/archive
Tencent Corpus for Chinese Words and Phrases Meant for AI purposes 6.3 GB Various https://ai.tencent.com/ailab/nlp/en/embedding.html
R-fMRI Maps Project Medical data from neurological imaging - Various http://mrirc.psych.ac.cn/RfMRIMaps
National Palace Museum (Taiwan) A variety of museum artifacts - Various https://theme.npm.edu.tw/opendata/