Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring the way data is returned in panedr #33

Merged
merged 14 commits into from
Jun 29, 2022

Conversation

BFedder
Copy link
Collaborator

@BFedder BFedder commented Jun 14, 2022

Addressing #25

I have changed the way data than panedr returns the energy data that it reads from EDR files.

Old behaviour

edr_to_df() opens the file, parses its binary contents, populates lists with the energy data, and assembles and returns a Pandas DataFrame.

New behaviour

The parsing of the EDR file is now done by a new read_edr() function.
read_edr() returns the lists that were previously generated by edr_to_df() as an intermediate step.
The data can now be returned as a dictionary of NumPy arrays with edr_to_dict(). Alternatively, returning the data as a Pandas DataFrame is possible via edr_to_df(). Both of these functions call read_edr().

The old behaviour is maintained, users can still call edr_to_df() and obtain the results they expect.

To Do

@pep8speaks
Copy link

pep8speaks commented Jun 14, 2022

Hello @BFedder! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-06-29 11:48:13 UTC

@BFedder
Copy link
Collaborator Author

BFedder commented Jun 14, 2022

I have now added a test that compares the dictionary of arrays that edr_to_dict() returns to edr_to_df()'s DataFrame. It converts the dictionary into a dataframe and asserts that this dataframe is equal to one generated with edr_to_df(). This works, but relies on pandas and on the other tests, so we might want to change this still.

Copy link
Member

@hmacdope hmacdope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start. have a look at comment for some changes.

panedr/panedr.py Outdated
Does not roll the reading position back.
"""
magic = data.unpack_int()
return magic == -7777777


def edr_to_df(path, verbose=False):
def read_edr(path, verbose_set=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change from verbose?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt weird about verbose=verbose, but yeah, I guess there is not really a reason not to put this 😅

panedr/panedr.py Outdated


def edr_to_df(path: str, verbose: bool = False):
import pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is pandas now an optional dependency in requirements.txt etc? If so I would guard this with a try: except similar to

        try:
            import pandas
        except ImportError:
            raise ImportError("""ERROR --- pandas was not found!
                pandas is required to use the `.edr_to_df()` functionality.
                try installing it using pip eg:
                    pip install pandas """)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll make a note to add a test for raising this error as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure what the best way to make pandas optional is. I have now done this by removing pandas from requirements.txt and adding a section under [extras] in setup.cfg. panedr can now be installed with pandas by running

pip install -e .[pandas]

Please let me know if this is not the best way to do these things.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather do things a little bit differently. The way you do it here will break the user experience for the standalone case. A user who do not use mdanalysis will install panedr and won't have pandas to use the main function out of the box.

Instead, I would create 2 packages: the default one that depends on pandas and a "lite" one for downstream integrators who want to minimise dependencies.

I don't know how to do that, though...

panedr/panedr.py Outdated


def edr_to_dict(path: str, verbose: bool = False):
import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an optional dependency? I think its probably safe to make numpy compulsory for panedr. If we do make it compulsory you can import this at the top of the file otherwise use a guard like the one for pandas mentioned in my other comment.

Thoughts @jbarnoud?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is reasonable to make numpy a compulsory dependency. The current users have to install it already because of pandas anyway; the new users will likely use it as well.

@@ -163,6 +163,14 @@ def _assert_progress_range(self, progress, dt, start, stop, step):
assert ref_line == progress_line


def test_edr_to_dict():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only really tests that it returns the same as edr_to_df so perhaps a more descriptive name like edr_to_dict_matches_edr_to_df. I know its a bit verbose but clarity is always good.

Copy link
Collaborator

@jbarnoud jbarnoud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no issue with the tests depending on pandas. Optional tests are usually a bad idea because they allow issues to fall through the cracks. It is justifiable in some cases, though, like when a dependency is very specialised and difficult to install.

panedr/panedr.py Outdated
@@ -75,7 +75,7 @@
Enxnm = collections.namedtuple('Enxnm', 'name unit')
ENX_VERSION = 5

__all__ = ['edr_to_df']
__all__ = ['edr_to_df', 'edr_to_dict']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason not to add read_edr here as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw read_edr as merely providing data for the two user-exposed functions edr_to_df and edr_to_dict, so that the user should never need to call read_edr itself directly. I am not sure if the return values of this function are of use to a user, but yeah, that's not really a reason to add it here.

for idx, name in enumerate(all_names):
energy_dict[name] = np.array(
[all_energies[frame][idx] for frame in range(len(times))])
return energy_dict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure that the "Time" key is in. I expect it to be, but I do not remember exactly how I treated it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Time" is part of all_names and all_energies as returned by read_edr. This is done specifically by

all_names = [u'Time'] + [nm.name for nm in edr_file.nms]
[...]
all_energies.append([frame.t] + [ener.e for ener in frame.ener])

@codecov
Copy link

codecov bot commented Jun 20, 2022

Codecov Report

Merging #33 (2659211) into master (81289f1) will increase coverage by 0.85%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #33      +/-   ##
==========================================
+ Coverage   81.99%   82.84%   +0.85%     
==========================================
  Files           2        2              
  Lines         261      274      +13     
==========================================
+ Hits          214      227      +13     
  Misses         47       47              
Impacted Files Coverage Δ
panedr/panedr.py 82.59% <100.00%> (+0.88%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81289f1...2659211. Read the comment docs.

try:
import pandas
except ImportError:
raise ImportError("""ERROR --- pandas was not found!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a test for that, which means a pipeline that does not install pandas and a test that runs the function and asserts the exception is raised.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added this test now

@@ -1,2 +1,2 @@
pandas
numpy>=1.19.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IAlibay is bumping the version to 1.20.0 for MDAnalysis in MDAnalysis/mdanalysis#3737

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't impact things here so it wouldn't think it matters. If you want to raise numpy to 1.20 you'll have to drop py3.6.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we prioritise here? Staying in-step with MDAnalysis' dependencies or keeping Py3.6?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't impact MDAnalysis if panedr supports a wider range of python & numpy versions, so IMHO it's fine to just leave it as-is.

@BFedder
Copy link
Collaborator Author

BFedder commented Jun 27, 2022

I'm thinking if there are no further comments here, it might make sense to merge this PR as is so that #42 can be merged soon as well. I will then add docstrings/type hints in a separate PR. What do you think @jbarnoud @hmacdope @IAlibay?
As far as I can see, the requested changes regarding the refactoring are all addressed now, and the requirement handling re: pandas is part of #42.

@BFedder BFedder changed the title [WIP] Refactoring the way data is returned in panedr Refactoring the way data is returned in panedr Jun 27, 2022
@IAlibay
Copy link
Member

IAlibay commented Jun 27, 2022

I'm thinking if there are no further comments here, it might make sense to merge this PR as is so that #42 can be merged soon as well. I will then add docstrings/type hints in a separate PR. What do you think @jbarnoud @hmacdope @IAlibay? As far as I can see, the requested changes regarding the refactoring are all addressed now, and the requirement handling re: pandas is part of #42.

Works with me.

tests/test_edr.py Outdated Show resolved Hide resolved
Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the one thing on my end - I'm happy to move "further cleaning up" (docstring, type hints) to a future PR if this means we can keep things progressing.

@hmacdope can you check if everything you requested has been addressed?

tests/test_edr.py Outdated Show resolved Hide resolved
Copy link
Member

@hmacdope hmacdope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to me @IAlibay. Great work @BFedder keep it up!

Co-authored-by: Irfan Alibay <[email protected]>
@hmacdope hmacdope merged commit 84bd117 into MDAnalysis:master Jun 29, 2022
BFedder added a commit to BFedder/panedr that referenced this pull request Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants