Convert Human Cell Atlas Tier 1 metadata extracted out of an anndata object of a published CELLxGENE dataset,into HCA DCP metadata schema ingestible spreadsheet.
This convertion is done in 3 steps.
- Pull data from CxG collect_cellxgene_metadata.py
- Given a collection_id, select dataset and download h5ad
- Pull obs and uns layer into csv files in
metadata
dir with<collection_id>_<dataset_id>
prefix in_metadata.csv
and_study_metadata.csv
filenames - Test if DOI exists in ingest (ingest-token required)
- Convert to DCP spreadsheet convert_to_dcp.py
- Given a collection_id & dataset_id pull metadata from metadata dir
- Based on hca_template.xlsx, using the mapping convert to dcp flat metadata file with dcp programmatic fields
- Based on the field programmatic name, the dcp spreadsheet is populated
- Exported into an xlsx file in
metadata
dir to<collection_id>_<dataset_id>_dcp.csv
filename
- Compare previously wrangled spreadsheet vs tier 1 compare_with_dcp.py
- Open cellxgene and previously wrangled DCP spreadsheet
- Compare number of tabs, use intersection
- On each common tab
- Compare number of entites per tab
- Compare ids per tab, for intersection
- Compare values of entities with same IDs (except protocols)
- Export all comparison in a report json file in
report_compare
dir to<collection_id>_<dataset_id>_compare.json
filename
Tested in python3.9. To run scripts you can run:
python3 -m pip install -r requirements.txt
python3 collect_cellxgene_metadata.py -c <CxG collection_id> -t <ingest-token>
python3 convert_to_dcp.py -c <CxG collection_id> -d <CxG dataset_id>
python3 compare_with_dcp.py -c <CxG collection_id> -d <CxG dataset_id> -w <previously wrangled spreadsheet path>
Alternatively, you can now use the wrapper_3c.py script to run all the scripts at once for multiple collections, using a separate csv file for the IDs & wrangled spreadsheets path.
python3 wrapper_3c.py -i input_spreadsheet.tsv
--collection_id
or-c
: CxG collection_id of the project.- i.e.
c353707f-09a4-4f12-92a0-cb741e57e5f0
,dc3a5256-5c39-4a21-ac0c-4ede3e7b2323
,20eea6c8-9d64-42c9-9b6f-c11b5249e0e9
- i.e.
--dataset
or-d
: Select the CxG dataset_id to download and convert.- i.e.
124744b8-4681-474a-9894-683896122708
,0bae7ebf-eb54-46a6-be9a-3461cecefa4c
,2e9d2f32-4cfb-49b5-b990-cbf4c241214e
- i.e.
--wrangled-path
or-w
: Path of previously wrangled spreadsheet to compare with converted from tier 1 spreadsheet--ingest-token
or-t
: Token of ingest for collecting DOI info from ingest--local_template
or-l
: Local instance of hca_template.xlsx
args | collect | convert | compare |
---|---|---|---|
--collection_id , -c |
required | required | required |
--dataset_id , -d |
optional | optional | optional |
--wrangled_path , -w |
n/a | n/a | required |
--ingest_token , -t |
optional | n/a | n/a |
--local_template , -l |
n/a | optional | n/a |