Process data #8

wagenrace · 2019-11-05T19:46:58Z

Downloads the zip from url.
Stores the zip on host computer
Loads csv training and csv validation straight from the zip

Add dependencies:

Pandas

Issue #4
I performed the following prior to filing this pull request:

Tested that my change does not break the analysis pipeline
Ran a linter through my code
Update environment dependencies if my code introduces a new package

gwaybio

Thanks for the PR @wagenrace - a couple of important comments below:

This code is used to download the data on figshare, not process the data to upload to figshare. In Adding Data Preparation Code #4, I am specifically interested in adding the code that @fheigwer used to process the data that was presumably downloaded from IDR (DOI: http://doi.org/10.17867/10000101). I think changing the naming conventions would make this much clearer. Perhaps instead of 0.process-data it should be 0.process-idr-images. To merge this PR, please change the module to 1.download-data.
I am thinking that the function downloadData() in main.py will never be used outside the 1.download-data module - lets put this file in a folder called 1.download-data/scripts
I was probably not being 100% clear, but in an analysis module repository style, each folder stands alone. Therefore, there should not be any need to import functions across folders and the current file 0process-data/__init__.py is not necessary.
The current file 0process-data/main.py is the code used to download the data. So anytime someone wanted access to the csv files, they would have to re-download. This module should be run only once by an external user. For example, the user will run some command that calls main.py and the data will be placed in appropriate folders (see Adding Data Download Code #5)
Please add execution instructions. This could be in the form of a jupyter notebook that, when run, will download.

.gitignore

0process-data/download_data/main.py

wagenrace · 2019-11-05T20:53:08Z

4. The current file `0process-data/main.py` is the code used to download the data. So anytime someone wanted access to the csv files, they would have to re-download. This module should be run only once by an external user. For example, the user will run some command that calls `main.py` and the data will be placed in appropriate folders (see #5)

about point 4. Why is unzipping needed? The zipped files can be loaded directly into python and be unzipped over there the same way it is done with mnist dataset
It will not re-download. There are checks to prevent that from happening

gwaybio

Thanks for the work @wagenrace - we are nearly there

.gitignore

1.download-data/README.md

1.download-data/downloadData.py

1.download-data/scripts/download_data.py

1.download-data/downloadData.py

gwaybio · 2019-11-10T15:21:00Z

Thanks for the important PR @wagenrace - I will go ahead and merge so we can continue with the project 👍

Tom Nijhof added 5 commits November 5, 2019 18:51

add random forest using python as a baseline

da324bf

Remove code again

031771c

add download data from url

ea6ca46

add data generator with csv files

0eee930

format with black

100158e

gwaybio requested changes Nov 5, 2019

View reviewed changes

.gitignore Show resolved Hide resolved

0process-data/download_data/main.py Outdated Show resolved Hide resolved

Tom Nijhof added 4 commits November 5, 2019 22:06

move to 1.download-data

3bac3d3

minimize gitignore to only include idle and normal python stuff

641cf51

add readme

1ebf9f5

add parser for url

e2ec9fb

wagenrace mentioned this pull request Nov 5, 2019

Download data #9

Closed

3 tasks

formated with black

cb88cdc

gwaybio requested changes Nov 8, 2019

View reviewed changes

Tom Nijhof added 15 commits November 8, 2019 16:14

remove package file from gitignore for pr cytodata#8

29f14a2

new line after header

8cc57fb

rename zip file

786adaf

add new name as default

c965436

update url

73c96da

remove defaults from function and use the one of parser

dc86348

only ignore .zip files in the data folder

c81e284

stimulate users to use the same data folder

55be92a

Hard code download url

22abd75

bugfix: downloading url

5767e2d

set the data folder in the main root

6f874ed

add the hash check of @gwaygenomics

6638858

format to pep8 import standards

1858304

update help of argument downloadLocation

3d41e78

also update this to follow pep8 import sort

eba8be3

gwaybio approved these changes Nov 10, 2019

View reviewed changes

gwaybio merged commit 14b63aa into cytodata:master Nov 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process data #8

Process data #8

wagenrace commented Nov 5, 2019 •

edited

Loading

gwaybio left a comment

wagenrace commented Nov 5, 2019

gwaybio left a comment

gwaybio commented Nov 10, 2019

Process data #8

Process data #8

Conversation

wagenrace commented Nov 5, 2019 • edited Loading

gwaybio left a comment

Choose a reason for hiding this comment

wagenrace commented Nov 5, 2019

gwaybio left a comment

Choose a reason for hiding this comment

gwaybio commented Nov 10, 2019

wagenrace commented Nov 5, 2019 •

edited

Loading