Refactoring the way data is returned in panedr #33

BFedder · 2022-06-14T14:02:48Z

Addressing #25

I have changed the way data than panedr returns the energy data that it reads from EDR files.

Old behaviour

edr_to_df() opens the file, parses its binary contents, populates lists with the energy data, and assembles and returns a Pandas DataFrame.

New behaviour

The parsing of the EDR file is now done by a new read_edr() function.
read_edr() returns the lists that were previously generated by edr_to_df() as an intermediate step.
The data can now be returned as a dictionary of NumPy arrays with edr_to_dict(). Alternatively, returning the data as a Pandas DataFrame is possible via edr_to_df(). Both of these functions call read_edr().

The old behaviour is maintained, users can still call edr_to_df() and obtain the results they expect.

To Do

Write tests for new functions
Write detailed documentation for new functions / panedr in general (Add docs? (if needed) #30)
Include more type hinting (Type hinting? #31)

pep8speaks · 2022-06-14T14:02:51Z

Hello @BFedder! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-06-29 11:48:13 UTC

BFedder · 2022-06-14T16:39:09Z

I have now added a test that compares the dictionary of arrays that edr_to_dict() returns to edr_to_df()'s DataFrame. It converts the dictionary into a dataframe and asserts that this dataframe is equal to one generated with edr_to_df(). This works, but relies on pandas and on the other tests, so we might want to change this still.

hmacdope

Good start. have a look at comment for some changes.

hmacdope · 2022-06-17T07:25:34Z

panedr/panedr.py

    Does not roll the reading position back.
    """
    magic = data.unpack_int()
    return magic == -7777777


-def edr_to_df(path, verbose=False):
+def read_edr(path, verbose_set=False):


Why the change from verbose?

I felt weird about verbose=verbose, but yeah, I guess there is not really a reason not to put this 😅

hmacdope · 2022-06-17T07:29:18Z

panedr/panedr.py

+
+
+def edr_to_df(path: str, verbose: bool = False):
+    import pandas


Is pandas now an optional dependency in requirements.txt etc? If so I would guard this with a try: except similar to

try: import pandas except ImportError: raise ImportError("""ERROR --- pandas was not found! pandas is required to use the `.edr_to_df()` functionality. try installing it using pip eg: pip install pandas """)

Good point. I'll make a note to add a test for raising this error as well.

I wasn't sure what the best way to make pandas optional is. I have now done this by removing pandas from requirements.txt and adding a section under [extras] in setup.cfg. panedr can now be installed with pandas by running

pip install -e .[pandas]

Please let me know if this is not the best way to do these things.

I would rather do things a little bit differently. The way you do it here will break the user experience for the standalone case. A user who do not use mdanalysis will install panedr and won't have pandas to use the main function out of the box.

Instead, I would create 2 packages: the default one that depends on pandas and a "lite" one for downstream integrators who want to minimise dependencies.

I don't know how to do that, though...

hmacdope · 2022-06-17T07:31:12Z

panedr/panedr.py

+
+
+def edr_to_dict(path: str, verbose: bool = False):
+    import numpy as np


Is this an optional dependency? I think its probably safe to make numpy compulsory for panedr. If we do make it compulsory you can import this at the top of the file otherwise use a guard like the one for pandas mentioned in my other comment.

Thoughts @jbarnoud?

I think it is reasonable to make numpy a compulsory dependency. The current users have to install it already because of pandas anyway; the new users will likely use it as well.

hmacdope · 2022-06-17T07:33:08Z

tests/test_edr.py

@@ -163,6 +163,14 @@ def _assert_progress_range(self, progress, dt, start, stop, step):
            assert ref_line == progress_line


+def test_edr_to_dict():


This only really tests that it returns the same as edr_to_df so perhaps a more descriptive name like edr_to_dict_matches_edr_to_df. I know its a bit verbose but clarity is always good.

jbarnoud

I see no issue with the tests depending on pandas. Optional tests are usually a bad idea because they allow issues to fall through the cracks. It is justifiable in some cases, though, like when a dependency is very specialised and difficult to install.

jbarnoud · 2022-06-18T10:16:30Z

panedr/panedr.py

@@ -75,7 +75,7 @@
 Enxnm = collections.namedtuple('Enxnm', 'name unit')
 ENX_VERSION = 5

-__all__ = ['edr_to_df']
+__all__ = ['edr_to_df', 'edr_to_dict']


I don't see a reason not to add read_edr here as well.

I saw read_edr as merely providing data for the two user-exposed functions edr_to_df and edr_to_dict, so that the user should never need to call read_edr itself directly. I am not sure if the return values of this function are of use to a user, but yeah, that's not really a reason to add it here.

jbarnoud · 2022-06-18T10:20:17Z

panedr/panedr.py

+    for idx, name in enumerate(all_names):
+        energy_dict[name] = np.array(
+            [all_energies[frame][idx] for frame in range(len(times))])
+    return energy_dict


Make sure that the "Time" key is in. I expect it to be, but I do not remember exactly how I treated it.

"Time" is part of all_names and all_energies as returned by read_edr. This is done specifically by

all_names = [u'Time'] + [nm.name for nm in edr_file.nms] [...] all_energies.append([frame.t] + [ener.e for ener in frame.ener])

codecov · 2022-06-20T16:48:52Z

Codecov Report

Merging #33 (2659211) into master (81289f1) will increase coverage by 0.85%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #33      +/-   ##
==========================================
+ Coverage   81.99%   82.84%   +0.85%     
==========================================
  Files           2        2              
  Lines         261      274      +13     
==========================================
+ Hits          214      227      +13     
  Misses         47       47

Impacted Files	Coverage Δ
panedr/panedr.py	`82.59% <100.00%> (+0.88%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81289f1...2659211. Read the comment docs.

jbarnoud · 2022-06-27T11:30:02Z

panedr/panedr.py

+    try:
+        import pandas
+    except ImportError:
+        raise ImportError("""ERROR --- pandas was not found!


There should be a test for that, which means a pipeline that does not install pandas and a test that runs the function and asserts the exception is raised.

@BFedder you should be able to just use monkeypatch here to do this: https://github.com/MDAnalysis/mdanalysis/blob/3e249fe7173e68bb61d4ad7a2dfd316014760410/testsuite/MDAnalysisTests/utils/test_datafiles.py#L28-L38

I have added this test now

jbarnoud · 2022-06-27T11:31:21Z

requirements.txt

@@ -1,2 +1,2 @@
-pandas
+numpy>=1.19.0


@IAlibay is bumping the version to 1.20.0 for MDAnalysis in MDAnalysis/mdanalysis#3737

Doesn't impact things here so it wouldn't think it matters. If you want to raise numpy to 1.20 you'll have to drop py3.6.

What should we prioritise here? Staying in-step with MDAnalysis' dependencies or keeping Py3.6?

It doesn't impact MDAnalysis if panedr supports a wider range of python & numpy versions, so IMHO it's fine to just leave it as-is.

BFedder · 2022-06-27T14:41:20Z

I'm thinking if there are no further comments here, it might make sense to merge this PR as is so that #42 can be merged soon as well. I will then add docstrings/type hints in a separate PR. What do you think @jbarnoud @hmacdope @IAlibay?
As far as I can see, the requested changes regarding the refactoring are all addressed now, and the requirement handling re: pandas is part of #42.

IAlibay · 2022-06-27T16:03:41Z

I'm thinking if there are no further comments here, it might make sense to merge this PR as is so that #42 can be merged soon as well. I will then add docstrings/type hints in a separate PR. What do you think @jbarnoud @hmacdope @IAlibay? As far as I can see, the requested changes regarding the refactoring are all addressed now, and the requirement handling re: pandas is part of #42.

Works with me.

tests/test_edr.py

IAlibay

Just the one thing on my end - I'm happy to move "further cleaning up" (docstring, type hints) to a future PR if this means we can keep things progressing.

@hmacdope can you check if everything you requested has been addressed?

tests/test_edr.py

hmacdope

All looks good to me @IAlibay. Great work @BFedder keep it up!

Co-authored-by: Irfan Alibay <[email protected]>

restructured functions returning energy data

c484bef

BFedder and others added 3 commits June 14, 2022 15:04

make pep8speaks happy

48085f3

rewrote verbose check

e700706

added test for edr_to_dict()

18e68c9

hmacdope requested changes Jun 17, 2022

View reviewed changes

jbarnoud reviewed Jun 18, 2022

View reviewed changes

BFedder and others added 2 commits June 19, 2022 18:41

Merge branch 'MDAnalysis:master' into return_dict

967807e

first batch of reviews addressed

d69ae9e

BFedder mentioned this pull request Jun 20, 2022

Handling pandas dependency #34

Closed

Merge branch 'master' into return_dict

09b8e9e

Merge branch 'master' into return_dict

3b708d1

orbeckst mentioned this pull request Jun 22, 2022

Evaluating coverage #37

Closed

hmacdope mentioned this pull request Jun 23, 2022

Add numpy to requirements.txt #39

Closed

BFedder and others added 2 commits June 25, 2022 16:43

changed required numpy version to match MDAnalysis

20d5e39

Merge branch 'MDAnalysis:master' into return_dict

52857ad

jbarnoud reviewed Jun 27, 2022

View reviewed changes

IAlibay mentioned this pull request Jun 27, 2022

First attempt: panedr and panedrlite #42

Merged

added test for pandas ImportError

ce812c5

BFedder changed the title ~~[WIP] Refactoring the way data is returned in panedr~~ Refactoring the way data is returned in panedr Jun 27, 2022

jbarnoud reviewed Jun 27, 2022

View reviewed changes

tests/test_edr.py Outdated Show resolved Hide resolved

BFedder and others added 2 commits June 27, 2022 17:54

Update test_edr.py

692ccad

Fixing

6ef2f84

IAlibay requested changes Jun 29, 2022

View reviewed changes

tests/test_edr.py Outdated Show resolved Hide resolved

hmacdope approved these changes Jun 29, 2022

View reviewed changes

Update tests/test_edr.py

2659211

Co-authored-by: Irfan Alibay <[email protected]>

hmacdope merged commit 84bd117 into MDAnalysis:master Jun 29, 2022

BFedder added a commit to BFedder/panedr that referenced this pull request Jun 29, 2022

merge MDAnalysis#33

9632971

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring the way data is returned in panedr #33

Refactoring the way data is returned in panedr #33

BFedder commented Jun 14, 2022 •

edited

Loading

pep8speaks commented Jun 14, 2022 •

edited

Loading

BFedder commented Jun 14, 2022

hmacdope left a comment

hmacdope Jun 17, 2022

BFedder Jun 19, 2022

hmacdope Jun 17, 2022

BFedder Jun 19, 2022

BFedder Jun 19, 2022

jbarnoud Jun 19, 2022

hmacdope Jun 17, 2022

jbarnoud Jun 18, 2022

hmacdope Jun 17, 2022

jbarnoud left a comment

jbarnoud Jun 18, 2022

BFedder Jun 19, 2022

jbarnoud Jun 18, 2022

BFedder Jun 19, 2022

codecov bot commented Jun 20, 2022 •

edited

Loading

jbarnoud Jun 27, 2022

IAlibay Jun 27, 2022

BFedder Jun 27, 2022

jbarnoud Jun 27, 2022

IAlibay Jun 27, 2022

BFedder Jun 27, 2022

IAlibay Jun 27, 2022

BFedder commented Jun 27, 2022 •

edited

Loading

IAlibay commented Jun 27, 2022

IAlibay left a comment

hmacdope left a comment



		def edr_to_df(path: str, verbose: bool = False):
		import pandas



		def edr_to_dict(path: str, verbose: bool = False):
		import numpy as np

		@@ -163,6 +163,14 @@ def _assert_progress_range(self, progress, dt, start, stop, step):
		assert ref_line == progress_line


		def test_edr_to_dict():

Refactoring the way data is returned in panedr #33

Refactoring the way data is returned in panedr #33

Conversation

BFedder commented Jun 14, 2022 • edited Loading

Old behaviour

New behaviour

To Do

pep8speaks commented Jun 14, 2022 • edited Loading

Comment last updated at 2022-06-29 11:48:13 UTC

BFedder commented Jun 14, 2022

hmacdope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbarnoud left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jun 20, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BFedder commented Jun 27, 2022 • edited Loading

IAlibay commented Jun 27, 2022

IAlibay left a comment

Choose a reason for hiding this comment

hmacdope left a comment

Choose a reason for hiding this comment

BFedder commented Jun 14, 2022 •

edited

Loading

pep8speaks commented Jun 14, 2022 •

edited

Loading

codecov bot commented Jun 20, 2022 •

edited

Loading

BFedder commented Jun 27, 2022 •

edited

Loading