Initial pass at a "prepare" step and minibatch extract #15

kcibul · 2022-09-14T00:59:40Z

Would like some eyes on the code, and then I'll update the readme/workflow!

Initial scale and performance testing - 4m cell dataset

Prepare: ~1 minute to prepare 4m cells for extract costing ~$1.30.
Extract: ~5 minutes to extract 100k cells to AnnData (700 mb) costing ~$0.003.

Expected linear scaling to 1 billion cells would cost < $355 ( $325 + $30 )

mcovarr · 2022-09-14T21:08:56Z

src/casp/extract_minibatch_to_anndata.py

+    features = []
+    for row in query:
+        features.append(Feature(row["cas_feature_index"], row["feature_name"]))


TOL the comprehension version of this logic on line 67 is nice... 🙂

so it is :D

src/casp/extract_minibatch_to_anndata.py

src/casp/prepare_for_training.py

mcovarr · 2022-09-15T00:03:24Z

src/casp/prepare_for_training.py

+                CAST(FLOOR(rand() * (select count(1) from `{project}.{dataset}.cas_cell_info`) / {extract_bin_size}) as INT) as extract_bin
+        FROM	`{project}.{dataset}.cas_cell_info` c
+        ORDER BY cas_cell_index
+    """


This seems to produce a nice set of roughly equally sized bins, except for that last one...

I can add a comment about this, but there will be a "remainder" bin that is smaller than the rest. There are other approaches, but a reasonable place to start for now and a we learn more about the ML requirements we can optimize. Ie how important is it for the bin to be the exact same size, vs what you asked for, etc.

Accepted suggestions Co-authored-by: Miguel Covarrubias <[email protected]>

kcibul added 4 commits September 12, 2022 20:42

WIP

2a7d1a9

initial prepare and minibatch extract

8423e21

linting

f4e6893

grr... linting...

a47e997

kcibul assigned mcovarr Sep 14, 2022

mcovarr reviewed Sep 15, 2022

View reviewed changes

kcibul and others added 3 commits September 19, 2022 10:38

Apply suggestions from code review

a71a6ed

Accepted suggestions Co-authored-by: Miguel Covarrubias <[email protected]>

PR comments and support for restricting to allow list

5561b4f

linting and comments

f25c575

mcovarr approved these changes Sep 19, 2022

View reviewed changes

kcibul merged commit e0d6a06 into main Sep 19, 2022

kcibul deleted the kc_prep branch September 19, 2022 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial pass at a "prepare" step and minibatch extract #15

Initial pass at a "prepare" step and minibatch extract #15

kcibul commented Sep 14, 2022 •

edited

Loading

mcovarr Sep 14, 2022

kcibul Sep 19, 2022

mcovarr Sep 15, 2022

kcibul Sep 19, 2022

Initial pass at a "prepare" step and minibatch extract #15

Initial pass at a "prepare" step and minibatch extract #15

Conversation

kcibul commented Sep 14, 2022 • edited Loading

mcovarr Sep 14, 2022

Choose a reason for hiding this comment

kcibul Sep 19, 2022

Choose a reason for hiding this comment

mcovarr Sep 15, 2022

Choose a reason for hiding this comment

kcibul Sep 19, 2022

Choose a reason for hiding this comment

kcibul commented Sep 14, 2022 •

edited

Loading