Pulling barcodes from FigShare.ipynb
is a notebook that shows how to download the image sets from FigShare, and extract the barcodes, which are then saved in barcodes_from_figshare.tsv
.
Checking Botany scan dates.ipynb
is a notebook that has a processing function for Dask that pulls out multiple image ids for records that have multiple images.
media_list.txt
contains a list of aws media ids which are used in the download_images.py
script. The file was created using the command aws s3 ls s3://smithsonian-open-access/media/nmnh/ > media_list.txt
. To reduce the size of the file, the version in this repository was filtered to only include the ids that end with .jpg
.
download_images.py
is a script that downloads botany images and metadata using SI Open Access on AWS. The metadata from AWS is simplified using the extract_ids
function in the script and saved to metadata.tsv
. The media ids from media_list.txt
are used to download 2292004 images to a thumbnails
directory. Further explanations of steps are commented in the script.