-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allowing to compute lambda fitting on series and stop dropping PM if it is in the REE pattern #100
Allowing to compute lambda fitting on series and stop dropping PM if it is in the REE pattern #100
Conversation
Pull Request Test Coverage Report for Build 7845529512Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @mmuesly, thanks again for the PR! I've had a look through these change, and added some comments here around the dataframe/series accessors.
For the lambdas side of things, I had a look at a few changes to adapt existing functions to achieve the same result, without adding much code/extra functions:
- The two helper functions can be condensed into one which has a switch for series/dataframes.
- For each of the algorithms/methods, I've made a few small modifications which allow them to be used with series, together with the helper function to wrap everything together. This makes it a bit easier to maintain and in most cases gives fewer places in the docs to go chasing information.
If OK with you I could push them to your branch, or otherwise post some of it here and we can discuss/edit after merging?
After this is merged I'll also add a few tests to pass series to these functions, and down the line perhaps modify the pyrolite.util.lambdas.calc_lambdas()
function to deal with series too.
pyrolite/geochem/__init__.py
Outdated
@@ -44,8 +44,10 @@ def list_elements(self): | |||
------- | |||
The list will have the same ordering as the source DataFrame. | |||
""" | |||
fltr = self._obj.columns.isin(_common_elements) | |||
return self._obj.columns[fltr].tolist() | |||
if isinstance(self._obj, pd.Series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the modifications in geochem.__init__.py
I wonder if there's a simpler way to do this without having if/else
branches under each function - perhaps by assigning a primary selection index under pyrochem.__init__()
and using it subsequently. There would be more to this for some functions.. but might work well for simple cases. Not yet tested, but something along these lines:
class pyrochem(object):
def __init__(self, obj):
"""Custom dataframe accessor for pyrolite geochemistry."""
self._validate(obj)
self._obj = obj
if isinstance(self._obj, pd.DataFrame):
self._selection_index = self._obj.columns
else: # elif isinstance(self._obj, pd.Series):
self._selection_index = self._obj.index
...
def list_oxides(self):
fltr = self._selection_index.isin(_common_oxides)
return self._selection_index[fltr].tolist()
@property
def oxides(self):
return self._obj[self.list_oxides]
@oxides.setter
def oxides(self, df):
self._obj[self.list_oxides] = df
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, I changed the PR.
pyrolite/geochem/__init__.py
Outdated
@@ -76,7 +80,16 @@ def list_REE(self): | |||
------- | |||
The returned list will reorder REE based on atomic number. | |||
""" | |||
return [i for i in REE() if i in self._obj.columns] | |||
if isinstance(self._obj, pd.Series): | |||
if "Pm" in self._obj.index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be simplified slightly to return [i for i in REE(dropPm=("Pm" not in self._obj.index)) if i in self._obj.index]
with the suggestion above around having a selection index, for both series and dataframes it would be return [i for i in REE(dropPm=("Pm" not in self._selection_index)) if i in self._selection_index]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, there is a limit to the amount of filter conditions in a list comprehension before readability gets harmed, but I have no strong opinion here. I have changed it in the PR.
@@ -22,3 +22,4 @@ comments, testing, bug reports, or feature requests. | |||
* `Sarah Shi <https://github.com/sarahshi>`__ | |||
* `Ondrej Lexa <https://github.com/ondrolexa>`__ | |||
* `Bob Myhill <https://github.com/bobmyhill>`__ | |||
* `Malte Mues <https://github.com/mmuesly>`__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding your name here!
Running the lambdas calculation on the series instead of the dataframe allows to embedd it easily into an apply function on the dataframe. In combination with pandarallel, this allows easy parallelisation of the task on larger datasets.
330dceb
to
b4a8a6b
Compare
Thank you for your feedback @morganjwilliams. Sorry for my late response. I got distracted by a paper deadline. I am happy, to see a smaller change set doing the same result. We need a version that works on Series. It took me some time to understand how the DataFrame implementation might work. Feel free to push your changes. I might help looking into adapting the test cases. Comparing the results from the existing code for testing the DataFrame implementation with a version that computes lambdas on each series should be a small modification to add an additional test that covers the important parts of the change. |
@morganjwilliams I pushed my proposal for unit tests. The invocation with apply for the lambda calculation is what is needed to plugin pandarallel later for performance reason. Sadly, I could not fix the docs generation error. I do not understand what the broken code is supposed to do. |
@mmuesly thanks for adding the test suggestions also, I'll have another read through and look to get this in (apologies, I've been traveling and also had some recent big deadlines). Unless there's anything specific around the high-level API, any small tweaks to organization etc I might look to just update on develop subsequently. I'll also chase up the docs-generation related error, it looks like a |
7f18072
into
morganjwilliams:develop
Description
We love the implementation for characterizing tetrades in REE patterns on a dataframe. For performance considerations, we require to wrap the implementation into an apply function on a dataframe. In its current form, it is not possible to compute lambdas only on a single series. I suggest to change the implementation to also support the core computation of lambdas on series objects. As a consequence, it can be used inside an apply function and offers easy parallelization in combination with the pandarallel package.
Moreover, I would love to stop the pyrochem.REE to drop PM if it is already in the dataset. Preparing plots gets more complicated if the already added column is dropped by this column.
Related Issue
#99
Motivation and Context
Apart from solving a performance issue, it also allows to define in a dataset excluded elements for each sample and which samples should be fitted using the LTE effect or not. When integrating lambda fitting into larger pipelines, it is easier to work on a per sample base than splitting the dataset into multiple data frames upfront.
We have many pipelines that use
pyrochem.REE
for selecting the REE pattern before plotting. We love to maintain the PM column in the pattern, if it is already present in the data set.Types of changes
Keeping the PM column if present is a breaking change. Currently, it is always dropped using
pyrochem.REE
Checklist:
pyrochem.REE
. Not sure.)I am happy to make further changes to this PR.