Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial pass at a "prepare" step and minibatch extract #15

Merged
merged 7 commits into from
Sep 19, 2022
Merged

Conversation

kcibul
Copy link
Collaborator

@kcibul kcibul commented Sep 14, 2022

Would like some eyes on the code, and then I'll update the readme/workflow!

Initial scale and performance testing - 4m cell dataset

Prepare: ~1 minute to prepare 4m cells for extract costing ~$1.30.
Extract: ~5 minutes to extract 100k cells to AnnData (700 mb) costing ~$0.003.

Expected linear scaling to 1 billion cells would cost < $355 ( $325 + $30 )

Comment on lines 48 to 50
features = []
for row in query:
features.append(Feature(row["cas_feature_index"], row["feature_name"]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOL the comprehension version of this logic on line 67 is nice... 🙂

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it is :D

src/casp/extract_minibatch_to_anndata.py Outdated Show resolved Hide resolved
src/casp/extract_minibatch_to_anndata.py Outdated Show resolved Hide resolved
src/casp/extract_minibatch_to_anndata.py Outdated Show resolved Hide resolved
src/casp/extract_minibatch_to_anndata.py Outdated Show resolved Hide resolved
src/casp/extract_minibatch_to_anndata.py Outdated Show resolved Hide resolved
src/casp/prepare_for_training.py Outdated Show resolved Hide resolved
CAST(FLOOR(rand() * (select count(1) from `{project}.{dataset}.cas_cell_info`) / {extract_bin_size}) as INT) as extract_bin
FROM `{project}.{dataset}.cas_cell_info` c
ORDER BY cas_cell_index
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to produce a nice set of roughly equally sized bins, except for that last one...

image

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a comment about this, but there will be a "remainder" bin that is smaller than the rest. There are other approaches, but a reasonable place to start for now and a we learn more about the ML requirements we can optimize. Ie how important is it for the bin to be the exact same size, vs what you asked for, etc.

@kcibul kcibul merged commit e0d6a06 into main Sep 19, 2022
@kcibul kcibul deleted the kc_prep branch September 19, 2022 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants