-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow writers refactor #153
Conversation
JesseMckinzie
commented
Oct 17, 2023
- Refactor process dataset functions to be overloaded to separate csv/buffer functionality from Arrow functionality
- Remove passing status to get_arrow_table and check for nullptr instead
- Remove low level functionality from nyxus main to process functions and the environment
Apply Sweep Rules to your PR?
|
src/nyx/environment.h
Outdated
|
||
bool use_arrow = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"use_arrow" will be defined as a class member no matter if we have Apache functionality enabled or disabled. Can we have no reference to Arrow if USE_ARROW is not defined ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have no references to the Arrow variables then we need ifdef statements throughout the code whenever the Arrow functionality is used, which we want to remove. We have to choose either hiding the Arrow variables or removing the ifdef statements (unless you have an idea for another work around)
src/nyx/globals.h
Outdated
@@ -32,7 +32,8 @@ namespace Nyxus | |||
|
|||
bool scanFilePairParallel(const std::string& intens_fpath, const std::string& label_fpath, int num_fastloader_threads, int num_sensemaker_threads, int filepair_index, int tot_num_filepairs); | |||
std::string getPureFname(const std::string& fpath); | |||
int processDataset(const std::vector<std::string>& intensFiles, const std::vector<std::string>& labelFiles, int numFastloaderThreads, int numSensemakerThreads, int numReduceThreads, int min_online_roi_size, bool save2csv, bool arrow_output, const std::string& csvOutputDir); | |||
int processDataset(const std::vector<std::string>& intensFiles, const std::vector<std::string>& labelFiles, int numFastloaderThreads, int numSensemakerThreads, int numReduceThreads, int min_online_roi_size, bool save2csv, const std::string& csvOutputDir); | |||
int processDataset(const std::vector<std::string>& intensFiles, const std::vector<std::string>& labelFiles, int numFastloaderThreads, int numSensemakerThreads, int numReduceThreads, int min_online_roi_size, const std::string& outputDir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to guess what function fits a CSV, memory buffer, or Apache output usage. Could we have argument "save2csv" of enum type (values "save2csv", "save2buffer", "save2arrow", "save2parquet", etc) rather than bool or at least document these functions with a comment block ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use an enum to choose save type and added in empty classes to replace ifdef statements, as discussed with Samee. Let me know if you agree with this approach
src/nyx/main_nyxus.cpp
Outdated
theEnvironment.useCsv, | ||
theEnvironment.output_dir); | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awkward code, to my taste. Could we have a single call and use a destination type parameter or a more elegant solution? (We are in no rush with this refactor.)
src/nyx/main_nyxus.cpp
Outdated
@@ -93,6 +101,8 @@ int main (int argc, char** argv) | |||
case 3: // Memory error | |||
std::cout << std::endl << "Memory error" << std::endl; | |||
break; | |||
case 4: | |||
std::cout << std::endl << "Arrow not enabled" << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's worth adding a couple of words how to enable, reinstall, or reset something? If Arrow libraries aren't in place, the executable/library wouldn't even run, so a user may be clueless why Arrow may still not be enabled.
src/nyx/python/new_bindings_py.cpp
Outdated
@@ -173,22 +173,36 @@ py::tuple featurize_directory_imp ( | |||
// We're good to extract features. Reset the feature results cache | |||
theResultsCache.clear(); | |||
|
|||
auto arrow_output = !pandas_output; | |||
theEnvironment.use_arrow = !pandas_output; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My traditional question here: are we supporting other Apache formats - "feather" and "parquet"? If yes, then "theEnvironment.use_arrow" should be called "theEnvironment.use_apache" ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I propose "use_apache_writers" to make it clear that the variable controls writing apache formats
src/nyx/main_nyxus.cpp
Outdated
} | ||
} | ||
else if (theEnvironment.useCsv) {return SaveOption::saveCSV;} | ||
else {return SaveOption::saveBuffer;} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in CLI mode, I don't this SaveOption::saveBuffer
is supported. It should be either arrow or csv, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, both "ARROW" and "ARROWIPC" should set saveOption to saveArrowIPC. Or may be get rid of "ARROW" option all together and be more precise.
src/nyx/arrow_output_stream.h
Outdated
#include <string> | ||
#include <memory> | ||
|
||
#include "output_writers.h" | ||
#include "helpers/helpers.h" | ||
#include "globals.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why we need this here.