Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring the way data is returned in panedr #33

Merged
merged 14 commits into from
Jun 29, 2022
29 changes: 23 additions & 6 deletions panedr/panedr.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@
import sys
import itertools
import time
import pandas


#Index for the IDs of additional blocks in the energy file.
#Blocks can be added without sacrificing backward and forward
Expand Down Expand Up @@ -75,7 +75,7 @@
Enxnm = collections.namedtuple('Enxnm', 'name unit')
ENX_VERSION = 5

__all__ = ['edr_to_df']
__all__ = ['edr_to_df', 'edr_to_dict']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason not to add read_edr here as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw read_edr as merely providing data for the two user-exposed functions edr_to_df and edr_to_dict, so that the user should never need to call read_edr itself directly. I am not sure if the return values of this function are of use to a user, but yeah, that's not really a reason to add it here.



class EDRFile(object):
Expand Down Expand Up @@ -395,21 +395,21 @@ def edr_strings(data, file_version, n):

def is_frame_magic(data):
"""Unpacks an int and checks whether it matches the EDR frame magic number

Does not roll the reading position back.
"""
magic = data.unpack_int()
return magic == -7777777


def edr_to_df(path, verbose=False):
def read_edr(path, verbose_set=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change from verbose?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt weird about verbose=verbose, but yeah, I guess there is not really a reason not to put this 😅

begin = time.time()
edr_file = EDRFile(str(path))
all_energies = []
all_names = [u'Time'] + [nm.name for nm in edr_file.nms]
times = []
for ifr, frame in enumerate(edr_file):
if verbose:
if verbose_set:
if ((ifr < 20 or ifr % 10 == 0) and
(ifr < 200 or ifr % 100 == 0) and
(ifr < 2000 or ifr % 1000 == 0)):
Expand All @@ -421,11 +421,28 @@ def edr_to_df(path, verbose=False):
all_energies.append([frame.t] + [ener.e for ener in frame.ener])

end = time.time()
if verbose:
if verbose_set:
print('\rLast Frame read : {}, time : {} ps'
.format(ifr, frame.t),
end='', file=sys.stderr)
print('\n{} frame read in {:.2f} seconds'.format(ifr, end - begin),
file=sys.stderr)

return all_energies, all_names, times


def edr_to_df(path: str, verbose: bool = False):
import pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is pandas now an optional dependency in requirements.txt etc? If so I would guard this with a try: except similar to

        try:
            import pandas
        except ImportError:
            raise ImportError("""ERROR --- pandas was not found!
                pandas is required to use the `.edr_to_df()` functionality.
                try installing it using pip eg:
                    pip install pandas """)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll make a note to add a test for raising this error as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure what the best way to make pandas optional is. I have now done this by removing pandas from requirements.txt and adding a section under [extras] in setup.cfg. panedr can now be installed with pandas by running

pip install -e .[pandas]

Please let me know if this is not the best way to do these things.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather do things a little bit differently. The way you do it here will break the user experience for the standalone case. A user who do not use mdanalysis will install panedr and won't have pandas to use the main function out of the box.

Instead, I would create 2 packages: the default one that depends on pandas and a "lite" one for downstream integrators who want to minimise dependencies.

I don't know how to do that, though...

all_energies, all_names, times = read_edr(path, verbose_set=verbose)
df = pandas.DataFrame(all_energies, columns=all_names, index=times)
return df


def edr_to_dict(path: str, verbose: bool = False):
import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an optional dependency? I think its probably safe to make numpy compulsory for panedr. If we do make it compulsory you can import this at the top of the file otherwise use a guard like the one for pandas mentioned in my other comment.

Thoughts @jbarnoud?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is reasonable to make numpy a compulsory dependency. The current users have to install it already because of pandas anyway; the new users will likely use it as well.

all_energies, all_names, times = read_edr(path, verbose_set=verbose)
energy_dict = {}
for idx, name in enumerate(all_names):
energy_dict[name] = np.array(
[all_energies[frame][idx] for frame in range(len(times))])
return energy_dict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure that the "Time" key is in. I expect it to be, but I do not remember exactly how I treated it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Time" is part of all_names and all_energies as returned by read_edr. This is done specifically by

all_names = [u'Time'] + [nm.name for nm in edr_file.nms]
[...]
all_energies.append([frame.t] + [ener.e for ener in frame.ener])

8 changes: 8 additions & 0 deletions tests/test_edr.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,14 @@ def _assert_progress_range(self, progress, dt, start, stop, step):
assert ref_line == progress_line


def test_edr_to_dict():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only really tests that it returns the same as edr_to_df so perhaps a more descriptive name like edr_to_dict_matches_edr_to_df. I know its a bit verbose but clarity is always good.

array_dict = panedr.edr_to_dict(EDR)
ref_df = panedr.edr_to_df(EDR)
array_df = pandas.DataFrame.from_dict(array_dict).set_index(
"Time", drop=False)
assert array_df.equals(ref_df)


def read_xvg(path):
"""
Reads XVG file, returning the data, names, and precision.
Expand Down