HCA - Tier 1 to DCP

Convert Human Cell Atlas Tier 1 metadata extracted out of an anndata object of a published CELLxGENE dataset,into HCA DCP metadata schema ingestible spreadsheet.

Algorithm

This convertion is done in 3 steps.

Pull data from CxG collect_cellxgene_metadata.py
1. Given a collection_id, select dataset and download h5ad
2. Pull obs and uns layer into csv files in metadata dir with <collection_id>_<dataset_id> prefix in _metadata.csv and _study_metadata.csv filenames
3. Test if DOI exists in ingest (ingest-token required)
Convert to DCP spreadsheet convert_to_dcp.py
1. Given a collection_id & dataset_id pull metadata from metadata dir
2. Based on hca_template.xlsx, using the mapping convert to dcp flat metadata file with dcp programmatic fields
3. Based on the field programmatic name, the dcp spreadsheet is populated
4. Exported into an xlsx file in metadata dir to <collection_id>_<dataset_id>_dcp.csv filename
Compare previously wrangled spreadsheet vs tier 1 compare_with_dcp.py
1. Open cellxgene and previously wrangled DCP spreadsheet
2. Compare number of tabs, use intersection
3. On each common tab
  1. Compare number of entites per tab
  2. Compare ids per tab, for intersection
  3. Compare values of entities with same IDs (except protocols)
4. Export all comparison in a report json file in report_compare dir to <collection_id>_<dataset_id>_compare.json filename

Usage

Tested in python3.9. To run scripts you can run:

python3 -m pip install -r requirements.txt
python3 collect_cellxgene_metadata.py -c <CxG collection_id> -t <ingest-token>
python3 convert_to_dcp.py -c <CxG collection_id> -d <CxG dataset_id>
python3 compare_with_dcp.py -c <CxG collection_id> -d <CxG dataset_id> -w <previously wrangled spreadsheet path>

Alternatively, you can now use the wrapper_3c.py script to run all the scripts at once for multiple collections, using a separate csv file for the IDs & wrangled spreadsheets path.

python3 wrapper_3c.py -i input_spreadsheet.tsv

Arguments

--collection_id or -c: CxG collection_id of the project.
- i.e. c353707f-09a4-4f12-92a0-cb741e57e5f0, dc3a5256-5c39-4a21-ac0c-4ede3e7b2323, 20eea6c8-9d64-42c9-9b6f-c11b5249e0e9
--dataset or -d: Select the CxG dataset_id to download and convert.
- i.e. 124744b8-4681-474a-9894-683896122708, 0bae7ebf-eb54-46a6-be9a-3461cecefa4c, 2e9d2f32-4cfb-49b5-b990-cbf4c241214e
--wrangled-path or -w: Path of previously wrangled spreadsheet to compare with converted from tier 1 spreadsheet
- i.e. metadata/scAgingHumanMaleSkin_metadata_03-08-2023.xlsx
--ingest-token or -t: Token of ingest for collecting DOI info from ingest
--local_template or -l: Local instance of hca_template.xlsx

Requirement of arguments per script

args	collect	convert	compare
`--collection_id`, `-c`	required	required	required
`--dataset_id`, `-d`	optional	optional	optional
`--wrangled_path`, `-w`	n/a	n/a	required
`--ingest_token`, `-t`	optional	n/a	n/a
`--local_template`, `-l`	n/a	optional	n/a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HCA - Tier 1 to DCP

Algorithm

Usage

Arguments

Requirement of arguments per script

About

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
helper_files		helper_files
report_compare		report_compare
.gitignore		.gitignore
README.md		README.md
collect_cellxgene_metadata.py		collect_cellxgene_metadata.py
compare_with_dcp.py		compare_with_dcp.py
convert_to_dcp.py		convert_to_dcp.py
input_spreadsheet.tsv		input_spreadsheet.tsv
requirements.txt		requirements.txt
wrapper_3c.py		wrapper_3c.py

ebi-ait/hca-tier1-to-dcp

Folders and files

Latest commit

History

Repository files navigation

HCA - Tier 1 to DCP

Algorithm

Usage

Arguments

Requirement of arguments per script

About

Resources

Stars

Watchers

Forks

Contributors 2

Languages