Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process data #8

Merged
merged 25 commits into from
Nov 10, 2019
Merged

Process data #8

merged 25 commits into from
Nov 10, 2019

Conversation

wagenrace
Copy link
Collaborator

@wagenrace wagenrace commented Nov 5, 2019

  • Downloads the zip from url.
  • Stores the zip on host computer
  • Loads csv training and csv validation straight from the zip

Add dependencies:

  • Pandas

Issue #4
I performed the following prior to filing this pull request:

  • Tested that my change does not break the analysis pipeline
  • Ran a linter through my code
  • Update environment dependencies if my code introduces a new package

Copy link
Contributor

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @wagenrace - a couple of important comments below:

  1. This code is used to download the data on figshare, not process the data to upload to figshare. In Adding Data Preparation Code #4, I am specifically interested in adding the code that @fheigwer used to process the data that was presumably downloaded from IDR (DOI: http://doi.org/10.17867/10000101). I think changing the naming conventions would make this much clearer. Perhaps instead of 0.process-data it should be 0.process-idr-images. To merge this PR, please change the module to 1.download-data.
  2. I am thinking that the function downloadData() in main.py will never be used outside the 1.download-data module - lets put this file in a folder called 1.download-data/scripts
  3. I was probably not being 100% clear, but in an analysis module repository style, each folder stands alone. Therefore, there should not be any need to import functions across folders and the current file 0process-data/__init__.py is not necessary.
  4. The current file 0process-data/main.py is the code used to download the data. So anytime someone wanted access to the csv files, they would have to re-download. This module should be run only once by an external user. For example, the user will run some command that calls main.py and the data will be placed in appropriate folders (see Adding Data Download Code #5)
  5. Please add execution instructions. This could be in the form of a jupyter notebook that, when run, will download.

.gitignore Show resolved Hide resolved
0process-data/download_data/main.py Outdated Show resolved Hide resolved
@wagenrace
Copy link
Collaborator Author

4. The current file `0process-data/main.py` is the code used to download the data. So anytime someone wanted access to the csv files, they would have to re-download. This module should be run only once by an external user. For example, the user will run some command that calls `main.py` and the data will be placed in appropriate folders (see #5)

about point 4. Why is unzipping needed? The zipped files can be loaded directly into python and be unzipped over there the same way it is done with mnist dataset
It will not re-download. There are checks to prevent that from happening

@wagenrace wagenrace mentioned this pull request Nov 5, 2019
3 tasks
Copy link
Contributor

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work @wagenrace - we are nearly there

.gitignore Show resolved Hide resolved
.gitignore Outdated Show resolved Hide resolved
1.download-data/README.md Show resolved Hide resolved
1.download-data/README.md Outdated Show resolved Hide resolved
1.download-data/README.md Outdated Show resolved Hide resolved
1.download-data/downloadData.py Outdated Show resolved Hide resolved
1.download-data/scripts/download_data.py Outdated Show resolved Hide resolved
1.download-data/scripts/download_data.py Outdated Show resolved Hide resolved
1.download-data/scripts/download_data.py Outdated Show resolved Hide resolved
1.download-data/downloadData.py Outdated Show resolved Hide resolved
@gwaybio
Copy link
Contributor

gwaybio commented Nov 10, 2019

Thanks for the important PR @wagenrace - I will go ahead and merge so we can continue with the project 👍

@gwaybio gwaybio merged commit 14b63aa into cytodata:master Nov 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants