Create sample dataset #72

davidackerman · 2024-02-13T20:32:50Z

We need a sample dataset in a predefined format for testing and demoing

rhoadesScholar · 2024-02-14T17:40:19Z

Locally generated small datasets, such as raws with a large X or two (in pixels) and GT being an inverted version
@d-v-b Pulling a couple training crops from s3://hela-2 plus raw data around those crops with some generous padding. Store them locally for convenient usage

d-v-b · 2024-02-14T19:32:34Z

which crops, how much padding, and where should it be saved?

rhoadesScholar · 2024-02-15T15:53:21Z

so far for the small datasets to pull from s3:// via a script:

raw: data/jrc_hela-2/jrc_hela-2.zarr/recon-1/em/fibsem-uint8 (with separate crops for each GT cube)
validation (converted to separate arrays):
- data/jrc_hela-2/staging/groundtruth.zarr/crop113/all
- data/jrc_hela-2/staging/groundtruth.zarr/crop155/all
train (converted to separate arrays):
- ...

avweigel · 2024-02-15T16:05:13Z

@d-v-b @yuriyzubov
the datasets that we want available on s3:// are all currently on our nrs. we want to include jrc_hela-2.zarr and the crops that are in /staging/groundtruth.zarr

the general format should follow our schema

em data: jrc_hela-2.zarr/recon-1/em/...
crop data: jrc_hela-2.zarr/recon-1/labels/groundtruth/...

explicit list of crops to be included:
crop1
crop3
crop4
crop6
crop7
crop8
crop9
crop13
crop14
crop15
crop16
crop18
crop19
crop23
crop28
crop54
crop55
crop56
crop57
crop58
crop59
crop94
crop95
crop96
crop113
crop155

d-v-b · 2024-02-15T16:36:35Z

upload crops to s3 from hela2/staging
a script will download from s3 a skeleton raw volume with em data only in areas with crops + 256 on each dimension

d-v-b · 2024-02-15T16:36:55Z

also, replace the current jrc_hela-2.zarr on s3

d-v-b · 2024-02-16T14:20:20Z

the data on s3 is now correct (i.e., the s3://janelia-cosem-datasets/jrc_hela-2/jrc_hela-2.zarr/recon-1/em/fibsem-uint8 and s3://janelia-cosem-datasets/jrc_hela-2/jrc_hela-2.zarr/recon-1/labels/groundtruth are populated). I started a command-line tool for copying the right data locally; i uploaded it as a gist which you can find here: https://gist.github.com/d-v-b/6dc1ae079b664711061490ba4b866c6c.

obviously this will eventually need to a) do all the things it's supposed to do, and b) be integrated into dacapo. but I don't think I can do either of those things today. @yuriyzubov (or anyone else), if you want to hack on this script feel free, just let me know, so that we can avoid duplicated effort. Specifically, if becomes part of dacapo, please link that PR or commit to this issue so I know about it. Otherwise I can finish it up over the weekend.

d-v-b · 2024-02-19T22:24:54Z

@avweigel two of the crops in this list overlap (6 and 113), is that OK?

d-v-b · 2024-02-19T22:30:18Z

I updated the gist with a fully-functioning script. it's pretty slow -- running it took several hours on my workstation -- but it does work. If the crappy performance is a problem, we can explore some performance optimizations. I am already doing some parallelism, but it's pretty coarse-grained and could surely benefit from some tooling.

@rhoadesScholar if I wanted this script to be added to dacapo, where would we put it in the source tree?

rhoadesScholar · 2024-03-13T17:53:44Z

@d-v-b The idea was to put it in the examples folder. But perhaps this should be done with a more minimal list of crops to speed things up. I imagine users might start getting frustrated after 5+ minutes if they're just trying to run an example notebook. 😬

Are you downloading the whole scale pyramids? Because that would explain a lot of slowness, can could be safely omitted for simple example cases imo.

d-v-b · 2024-03-14T09:47:58Z

I will see how things run with a reduced number of crops + only downloading s0, and I will open a PR that actually adds the script to dacapo in the examples folder.

mzouink · 2024-03-14T13:23:20Z

Sorry for jumping in late,
I think the goal is to have something similar to Tensorflow and Pytorch
Tensorflow

import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train', shuffle_files=True)

Pytorch

from torchvision import datasets
training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

DaCapo Example:

import dacapo_datasets as dds
datasplit = dds.HeLaCell(
    path="/path")

rhoadesScholar · 2024-03-14T13:27:18Z

I think dacapo.datasets.HelaCell would be an awesome entry point.

mzouink · 2024-03-14T13:30:37Z

if we want to give them the best experience. i will recommand the hello world example to finetune setup04 model
to do this, i would recommand pulling gt from s1 (8nm) and raw from s2 (16nm, because we are using upsample unet)

rhoadesScholar · 2024-03-14T13:31:31Z

from dacapo import datasets
training_data = datasets.HelaCell(
    download=True, # download data (instead of training from cloud, which isn't implemented yet)
    root="data", # download to folder "./data/"
    raw_scale=8, # download the 8nm raw data
    gt_scale=4, # download the 8nm GT data
)

d-v-b · 2024-03-14T13:44:11Z

what's the type of training_data here?

mzouink · 2024-03-14T13:47:08Z

It is a DaCapo DataSplit
https://github.com/janelia-cellmap/dacapo/blob/datasplit_generator/dacapo/experiments/datasplits/datasplit.py
If you can help me with the downloader from S3. i can wrap data using
https://github.com/janelia-cellmap/dacapo/blob/datasplit_generator/dacapo/experiments/datasplits/datasplit_generator.py

rhoadesScholar · 2024-07-11T20:03:07Z

Revisiting this @d-v-b and @yuriyzubov. We need a script to essentially do this for the segmentation challenge as well.

rhoadesScholar assigned d-v-b and yuriyzubov Feb 14, 2024

rhoadesScholar assigned avweigel Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create sample dataset #72

Create sample dataset #72

davidackerman commented Feb 13, 2024

rhoadesScholar commented Feb 14, 2024 •

edited

Loading

d-v-b commented Feb 14, 2024

rhoadesScholar commented Feb 15, 2024 •

edited

Loading

avweigel commented Feb 15, 2024

d-v-b commented Feb 15, 2024

d-v-b commented Feb 15, 2024

d-v-b commented Feb 16, 2024

d-v-b commented Feb 19, 2024

d-v-b commented Feb 19, 2024

rhoadesScholar commented Mar 13, 2024 •

edited

Loading

d-v-b commented Mar 14, 2024

mzouink commented Mar 14, 2024

rhoadesScholar commented Mar 14, 2024

mzouink commented Mar 14, 2024 •

edited

Loading

rhoadesScholar commented Mar 14, 2024 •

edited

Loading

d-v-b commented Mar 14, 2024

mzouink commented Mar 14, 2024

rhoadesScholar commented Jul 11, 2024

Create sample dataset #72

Create sample dataset #72

Comments

davidackerman commented Feb 13, 2024

rhoadesScholar commented Feb 14, 2024 • edited Loading

d-v-b commented Feb 14, 2024

rhoadesScholar commented Feb 15, 2024 • edited Loading

avweigel commented Feb 15, 2024

d-v-b commented Feb 15, 2024

d-v-b commented Feb 15, 2024

d-v-b commented Feb 16, 2024

d-v-b commented Feb 19, 2024

d-v-b commented Feb 19, 2024

rhoadesScholar commented Mar 13, 2024 • edited Loading

d-v-b commented Mar 14, 2024

mzouink commented Mar 14, 2024

rhoadesScholar commented Mar 14, 2024

mzouink commented Mar 14, 2024 • edited Loading

rhoadesScholar commented Mar 14, 2024 • edited Loading

d-v-b commented Mar 14, 2024

mzouink commented Mar 14, 2024

rhoadesScholar commented Jul 11, 2024

rhoadesScholar commented Feb 14, 2024 •

edited

Loading

rhoadesScholar commented Feb 15, 2024 •

edited

Loading

rhoadesScholar commented Mar 13, 2024 •

edited

Loading

mzouink commented Mar 14, 2024 •

edited

Loading

rhoadesScholar commented Mar 14, 2024 •

edited

Loading