-
Notifications
You must be signed in to change notification settings - Fork 2
Implementation discussion- parallelization #2
Comments
Per @0x00b1 's comment #1 (comment) 24 is the number of columns in a "standard" plate (384, which is 24 columns x 16 rows). |
It will be fantastic if we can make use of the way Parquet structures data so it will be trivial to read an entire batch of data as a single Parquet dataset, even if the individual plates were created separately
https://arrow.apache.org/docs/python/parquet.html#reading-from-partitioned-datasets |
It is typically called
I think per column is sufficient sharding, but can you tell us what are the tradeoffs we should consider between too much and too little sharding? Each well will typically have 2000 to 4000 rows (=number of cells) x 2000 to 5000 columns. |
Great. This is merged. |
The following is off-topic (not usage-related) and is getting into implementation, so feel free to ignore for now, or bump to another thread.
It currently takes ~3-5 hours to ingest a file. Our current choice of SQLite does not allow us to do parallel writes, so there's no way to parallelize this.
But we do now, because can store it as a Parquet Dataset which can have multiple files.
So for a 384-well dataset, we can save the output as a Parquet dataset with, say, 24 files (one for each column of the384-well plate). This will also make parallel reads faster, so e.g. aggregation can be faster.
Originally posted by @shntnu in #1 (comment)
The text was updated successfully, but these errors were encountered: