Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should .files pretend virtual overview files don't exist? #286

Open
takluyver opened this issue Mar 8, 2022 · 7 comments
Open

Should .files pretend virtual overview files don't exist? #286

takluyver opened this issue Mar 8, 2022 · 7 comments

Comments

@takluyver
Copy link
Member

The .files attribute (on DataCollection, SourceData, KeyData) is not documented, but also not _private by convention. This puts it in a grey area as API.

It still technically works with virtual overview files - it will point to one virtual overview file (per run) rather than a list of actual source files. But for some purposes, people might want to see a list of the relevant source files.

It's relatively easy to get a list of all the source files for a run - either from the virtual overview file itself, or by scanning the run folder. We could make .files a property that does this on demand. But it gets trickier to filter it following source & train selections if EXtra-data is not actually using the files. Still possible if we have to, but trickier.

A workaround is that people can pass _use_voview=False when opening a run to bypass the virtual overview file and work with the regular source files like before.

@philsmt
Copy link
Contributor

philsmt commented Mar 9, 2022

Does _use_voview=False continue to use the legacy data map mechanism? Are they also written when opening with _use_voview=True the first time?

@takluyver
Copy link
Member Author

With _use_voview=False, the JSON 'run data map' files are created and used as before. If it can use a virtual overview file, the JSON map is neither used nor created.

if _use_voview and (sel_files == files):
voview_file_acc = voview.find_file_valid(path)
if voview_file_acc is not None:
return DataCollection([voview_file_acc])
files_map = RunFilesMap(path)
t0 = time.monotonic()
d = DataCollection.from_paths(
files, files_map, inc_suspect_trains=inc_suspect_trains,
is_single_run=True, parallelize=parallelize
)
log.debug("Opened run with %d files in %.2g s",
len(d.files), time.monotonic() - t0)
files_map.save(d.files)

@philsmt
Copy link
Contributor

philsmt commented Mar 9, 2022

One cautious option might be to rename the current property to _files, and throw an exception for files if opened with virtual overview files. That at least protects against code using it from failing silently, and we might decide to simulate the previous behaviour as you described in a second step.

@takluyver
Copy link
Member Author

True, but depending on what code wants .files for, it may also break things that currently work. The list of one FileAccess object is valid.

@philsmt
Copy link
Contributor

philsmt commented Mar 9, 2022

That is true, but my worry is silently breaking code exactly because the list is still valid, but not what was expected. Preventing that may well be worth it visibly breaking code that may have worked anyways, but which can be fixed as easily.

@takluyver
Copy link
Member Author

But the easiest fix for code where that's not an issue would be to use ._files instead, which trains people to use private attributes and makes future changes more difficult. Or they disable using virtual overview files, and then don't get the benefits.

@tmichela
Copy link
Member

I actually bumped into it today and I used _use_voview=False as a workaround.

we could maybe populate the list with original files when using overview files (the path information to original files is stored there) and lazily generate FileAccess whenever trying to access DataCollection.files. But honestly I'm not sure it's worth the effort...

Are you aware of any practical use of .files?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants