Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplication, support select_streams on single pass, support headers-only #74

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

cboulay
Copy link
Contributor

@cboulay cboulay commented Dec 29, 2020

This PR deduplicates code, allows use of select_streams without double traversal of file (#60) and adds the ability to extract headers only.

It was a bit tricky to do while holding onto a couple top-level functions. I hope I did it to everyone's satisfaction, though it's maybe a little messier than it could have been otherwise.

I think the logger is probably a little too verbose by default, especially when skipping over non-selected streams or grabbing only headers. But I'd like some feedback before I clean that up.

I also think some of the top level functions (even the _ prefixed ones) can be re-ordered more logically, but I wanted to limit the already-large diff.

There are quite a few changes here so I'm tagging lots of reviewers. Please test on your data files. If we're happy with my previous PR then I'll merge that in here so we can test on data files with problematic clock synchronization.

…retain stream-headers-only for resolve_streams.
* main load func uses shared _read_chunks
* support stream_headers_only
* support select_streams on first pass
@cbrnr
Copy link
Contributor

cbrnr commented Dec 30, 2020

Am I right in assuming that resolve_streams only needs stream headers? Indeed the functions it calls (parse_xdf, parse_chunks, _read_chunks) only operate on stream headers so the answer seems like an obvious "yes", but those functions' doc strings don't state this explicitly so I was just wondering if you were planning on building these out in the future?

@cboulay yes, to implement select_streams I was lazy and used parts of existing code that I'd written in an effort to write an XDF reader from scratch. Therefore, the functions you mention only need stream headers, but they are also intended to be used for reading data. At that time I thought it was more important to ship the select streams feature than keeping the code base free from duplicated code.

In my deduplication effort, I'm reusing your _read_chunks method, but this means adding a lot of other functionality to it, only to disable that functionality with stream_headers_only=True when being called via parse_xdf (via resolve_streams). Is this compatible with your use-case?

I need parse_chunks and parse_xdf - so yes, these changes should be compatible. In any case, I will test if everything still works with these changes, just let me know when this is ready for review.

@cboulay
Copy link
Contributor Author

cboulay commented Dec 30, 2020

@cbrnr , ah, I was operating under the assumption that you only needed headers, so I renamed parse_chunks to parse_stream_header_chunks to be explicit. Sorry about that. We can rename it back.

But, we should discuss a little about whether or not that function is required at all. After the changes in this PR, there remains a difference in stream header formatting between load_xdf and your functions. Namely, load_xdf's stream header dict has all of its values wrapped in a list, even the simple strings (e.g. host) and single numbers (e.g. srate). This is because of the use of defaultdict in _xml2dict. I'm not a big fan of this, but changing this now would break any downstream applications that use pyxdf, so it's there to stay.
Do you absolutely need to have your stream headers "flatted" or "delistified"?

On that same train of thought, if you do build out parse_chunks further, and eventually have it handle data and build concatenated arrays, would you be satisfied if the format of the completed streams was the same as that currently returned by load_xdf? If yes, then maybe we can move the stream-building code out of load_xdf and into parse_chunks, then parse_chunks can be used by in load_xdf, with an added kwarg to not flatten the stream header dict values.

Please take a look at what I have in place now and let me know how you would like to proceed.

I think we should leave fixing up the 'verbose' usage until a later issue.

@cbrnr
Copy link
Contributor

cbrnr commented Dec 30, 2020

@cboulay I will need to dig a little bit into my code to see what I really need. This will take some time, and since I'm technically on vacation right now it would be great if you could give me two weeks or so to check on this. For now, I think that streamlining the code base should take precedence over a specific use case I might have. After all, I can always pull out the functions I need into my other projects. But if possible, let's discuss this next year 😄 (i.e. in ~2 weeks).

@cboulay
Copy link
Contributor Author

cboulay commented Dec 30, 2020

Of course! We need the headers-only feature sooner-than-later but we're happy go use the branch for now.
Happy New Year 🥳

@agricolab
Copy link
Member

agricolab commented Jan 2, 2021

Generally speaking, i am totally ok in rewriting my code to accomodate this PR and think it is worthwhile.

I just have a few questions. One of the tests for my liesl toolbox failed, for a functtionality where I am using parse_chunks to inspect xdf files from the cli. That is easy to accomodate with a try catch and import parse_stream_header_chunks instead. But then it complains that there is also no source_id anymore in the output of the new parse_stream_header_chunks? I can dig a little bit deeper into the code base, but if you can tell me quickly what would be the process to recover the output of the old parse_chunks with the new functions, that'd save me some time and brain juice. Thanks!

@cboulay
Copy link
Contributor Author

cboulay commented Jan 2, 2021

@agricolab - Based on what Clemens said, I think I should rename parse_stream_header_chunks back to parse_chunks, so you don't need to worry about that.

Let me explain the organization as I see it before we decide what to do.

Path 1 - using load_xdf:

  1. open_xdf unzips if necessary, reads and checks the magic code
  2. _read_chunks generator function yields a dict per chunk.
    i. Error checking
    ii. Basic processing to put the data into the dict (e.g., _xml2dict, _normalize_header, _read_chunk3, using StreamData to track timestamps).
    iii. kwargs to control select_streams, stream_headers_only
    * select_streams uses match_streaminfos with only a single stream.
  3. Yielded chunks are further processed in load_xdf to build up temp_stream_data.
  4. Optional synchronize clocks
  5. Optional dejitter

Path 2 - using resolve_streams

  1. parse_xdf
  2. open_xdf as above
  3. _read_chunks as above
  4. parse_chunks

Note that we should not change the format returned by Path1 as this would break 3rd party applications we don't control.

Remaining things to change:

  • Move the "extra" processing that happens in load_xdf to a function named parse_chunk
  • Use parse_chunk inside both load_xdf and parse_chunks
  • As much as we can, change the data formats in Path 2 so it's similar to that returned by Path 1.
    • Where they cannot be changed, modify in Path 2.
    • skip_desc can probably be removed because I don't think Clemens actually needs to exclude desc from the info.
  • Somewhere along the way I guess I lost 'source_id', or at least it's not in the same location as previously. Which is correct?

After the above changes, the main difference between Path 1 and Path 2 is that Path 2 loads all the chunks before parsing them, whereas Path 1 parses them as they come in and aggregates them after they've all been parsed. Path 1 also has optional timestamp fixes.

With select_streams and stream_headers_only, there's really not much of a performance difference between the two paths.

@tstenner
Copy link
Contributor

tstenner commented Jan 4, 2021

I haven't tested the PR yet, but from what I've read I'm all for it.

I have only two things to add:

  • _read_chunks should yield classes (i.e. an Iterable[Chunk] with Chunk as below) so it's possible to add type info for IDEs and Cython. These could be shared with a not yet existing python xdf writer:
class Chunk:
    tag: int
    header_len: int
    payload_len: int
  • _normalize_header discards all fields not listed and is quite verbose. The default fields that don't need type conversion could be added with a list comprehension, i.e.
fields = ['name', 'type']
res = { key: header.get(key,[None]) for key in fields}
res['channel_count'] = [int(header['channel_count'][0]]

@cboulay
Copy link
Contributor Author

cboulay commented Mar 31, 2021

@cbrnr Have you had a chance to look at this? Can you please take a look and make sure that your use-cases work or suggest how to make it work?

@cbrnr
Copy link
Contributor

cbrnr commented Apr 1, 2021

@cboulay sorry, I completely forgot. I will take a look today.

@cbrnr
Copy link
Contributor

cbrnr commented Apr 1, 2021

@cboulay can you maybe rename the function back to parse_chunks? Otherwise, I immediately run into an error because I need this function. After that, testing will be a bit easier for me.

@cboulay
Copy link
Contributor Author

cboulay commented Apr 1, 2021

I’m unavailable today. Please go ahead and make that change on your end and I’ll resolve later.

@cbrnr
Copy link
Contributor

cbrnr commented Apr 1, 2021

Alright, all of my files still work with this PR. In addition to comments already mentioned, I suggest that we make open_xdf a private function (i.e. rename it to _open_xdf). Furthermore, the public API is currently a bit confusing with the following functions (I'm already excluding open_xdf):

  • load_xdf: the main function which loads an XDF file
  • match_streaminfos: find stream IDs matching specified criteria
  • resolve_streams: resolve streams in given XDF file, i.e. return information on all streams
  • parse_xdf
  • parse_chunks

I'm using parse_chunks(parse_xdf(fname)) only to extract stream infos, so we could maybe simply expose one function that returns the composition of the two. Not sure if anyone else is using one or both of these functions separately though. I'm also wondering if this could be completely replaced by resolve_streams - I'll have to take a look.

@cboulay could you compile a new to do list summarizing all the points mentioned in this thread? It doesn't mean that you have to implement all of the suggestions, but I'd like to discuss all points with everyone.

Copy link
Contributor

@chkothe chkothe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@cbrnr
Copy link
Contributor

cbrnr commented May 4, 2021

BTW, test_load_from_memory fails with a KeyError because testfile does not have an "info" field. This test works on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants