-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial pass at a "prepare" step and minibatch extract #15
Conversation
features = [] | ||
for row in query: | ||
features.append(Feature(row["cas_feature_index"], row["feature_name"])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TOL the comprehension version of this logic on line 67 is nice... 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it is :D
CAST(FLOOR(rand() * (select count(1) from `{project}.{dataset}.cas_cell_info`) / {extract_bin_size}) as INT) as extract_bin | ||
FROM `{project}.{dataset}.cas_cell_info` c | ||
ORDER BY cas_cell_index | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a comment about this, but there will be a "remainder" bin that is smaller than the rest. There are other approaches, but a reasonable place to start for now and a we learn more about the ML requirements we can optimize. Ie how important is it for the bin to be the exact same size, vs what you asked for, etc.
Accepted suggestions Co-authored-by: Miguel Covarrubias <[email protected]>
Would like some eyes on the code, and then I'll update the readme/workflow!
Initial scale and performance testing - 4m cell dataset
Prepare: ~1 minute to prepare 4m cells for extract costing ~$1.30.
Extract: ~5 minutes to extract 100k cells to AnnData (700 mb) costing ~$0.003.
Expected linear scaling to 1 billion cells would cost < $355 ( $325 + $30 )