Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtractorProcessor.__call__() does not retain the order given by the members filter #457

Open
carlocastoldi opened this issue Jan 10, 2025 · 1 comment
Labels
bug Report a problem that needs to be fixed

Comments

@carlocastoldi
Copy link

Description of the problem:
When I use an ExtractorProcessor (e.g. Unzip), I would expect that if I specify a list of files i want to extract, the order of the extracted paths was retained.
Instead, ExtractorProcessor uses os.walk(self.extract_dir)

for path, _, files in os.walk(self.extract_dir):
and then filters the extracted files based on the given members, resulting in an arbitrary order.

I would like to suggest to invert the check: loop over the members (if any), and filter on the extracted files.

Also: why not use pathlib.Path?

Example

>>> pooch.Unzip(
                extract_dir=".",
                members=[
                        'E11.5/E11-5_DevCCF_Annotations_31.5um.nii.gz',
                        'E11.5/E11-5_MRI-adc_31.5um.nii.gz',
                ])

Expected output

'/home/castoldi/brainglobe_workingdir/kim_dev_mouse/downloads/./E11.5/E11-5_DevCCF_Annotations_31.5um.nii.gz']
'/home/castoldi/brainglobe_workingdir/kim_dev_mouse/downloads/./E11.5/E11-5_MRI-adc_31.5um.nii.gz',

Actual output

'/home/castoldi/brainglobe_workingdir/kim_dev_mouse/downloads/./E11.5/E11-5_MRI-adc_31.5um.nii.gz',
'/home/castoldi/brainglobe_workingdir/kim_dev_mouse/downloads/./E11.5/E11-5_DevCCF_Annotations_31.5um.nii.gz']

System information

  • Operating system: Ubuntu 24.04
  • Python installation (Anaconda, system, ETS): system
  • Version of Python: 3.12.3
  • Version of this package: 1.8.2
@carlocastoldi carlocastoldi added the bug Report a problem that needs to be fixed label Jan 10, 2025
@santisoler
Copy link
Member

Hi @carlocastoldi. Thanks for opening this issue. I agree that it's annoying to get a list of extracted files in a different order than the one specified through the members argument. The order of the resulting list of files is not specified, but I agree that it's not intuitive to receive the files in a different order.

I think we could open a PR to fix this, so the resulting list keeps the order of files in the members argument.

I would like to suggest to invert the check: loop over the members (if any), and filter on the extracted files.

We should assume that the archive (.zip, .tar, etc) might contain a lot of files (let's say N files), while the members list will contain a subset of them (M = len(members), M <= N). For this reason, we would like to walk the tree of files in the archive file only once. With your idea of looping over the members and then filter each one of them from the tree, we would need to explore the tree M times. We would impact the performance: now it takes O(N) to filter all files, and by inverting the loops it would take O(M*N).

One alternative would be to keep the loops as they are now, and push a filename to a dictionary that matches the members with their corresponding extracted files. By the end of the method, we could build the fnames list by obtaining the respective extracted files for each one of the members, in the correct order. Something like:

extracted = {}
for path, _, files in os.walk(self.extract_dir):
    for filename in files:
        ...
        extracted[member] = extracted_file

fnames = [extracted[member] for member in self.members]
return fnames

The overhead of creating the dictionary is minimal, so with this approach we don't hit the performance of the method.

If we move forward with this, I think we should take care also the case when members=None (we should return all extracted files, so we don't care too much about the order).

One other thing I noticed is that the current implementation ignores files that are in the members but do not exist in the archive. I think this is another bug: we should error out or warn the user that one or more files in the members do not exist in the archive instead of silently ignoring it.

Let me know what do you think!

Also: why not use pathlib.Path?

Historical reasons, plus lack of time to update all the codebase. I'd be glad to ditch all those os.path.join and similar calls in Pooch's codebase!


Would you like to work on this? If so, feel free to open a PR to lay down some ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Report a problem that needs to be fixed
Projects
None yet
Development

No branches or pull requests

2 participants