Deduplication, support select_streams on single pass, support headers-only #74

cboulay · 2020-12-29T23:26:22Z

This PR deduplicates code, allows use of select_streams without double traversal of file (#60) and adds the ability to extract headers only.

It was a bit tricky to do while holding onto a couple top-level functions. I hope I did it to everyone's satisfaction, though it's maybe a little messier than it could have been otherwise.

I think the logger is probably a little too verbose by default, especially when skipping over non-selected streams or grabbing only headers. But I'd like some feedback before I clean that up.

I also think some of the top level functions (even the _ prefixed ones) can be re-ordered more logically, but I wanted to limit the already-large diff.

There are quite a few changes here so I'm tagging lots of reviewers. Please test on your data files. If we're happy with my previous PR then I'll merge that in here so we can test on data files with problematic clock synchronization.

…retain stream-headers-only for resolve_streams.

* main load func uses shared _read_chunks * support stream_headers_only * support select_streams on first pass

cbrnr · 2020-12-30T11:08:57Z

Am I right in assuming that resolve_streams only needs stream headers? Indeed the functions it calls (parse_xdf, parse_chunks, _read_chunks) only operate on stream headers so the answer seems like an obvious "yes", but those functions' doc strings don't state this explicitly so I was just wondering if you were planning on building these out in the future?

@cboulay yes, to implement select_streams I was lazy and used parts of existing code that I'd written in an effort to write an XDF reader from scratch. Therefore, the functions you mention only need stream headers, but they are also intended to be used for reading data. At that time I thought it was more important to ship the select streams feature than keeping the code base free from duplicated code.

In my deduplication effort, I'm reusing your _read_chunks method, but this means adding a lot of other functionality to it, only to disable that functionality with stream_headers_only=True when being called via parse_xdf (via resolve_streams). Is this compatible with your use-case?

I need parse_chunks and parse_xdf - so yes, these changes should be compatible. In any case, I will test if everything still works with these changes, just let me know when this is ready for review.

cboulay · 2020-12-30T16:52:10Z

@cbrnr , ah, I was operating under the assumption that you only needed headers, so I renamed parse_chunks to parse_stream_header_chunks to be explicit. Sorry about that. We can rename it back.

But, we should discuss a little about whether or not that function is required at all. After the changes in this PR, there remains a difference in stream header formatting between load_xdf and your functions. Namely, load_xdf's stream header dict has all of its values wrapped in a list, even the simple strings (e.g. host) and single numbers (e.g. srate). This is because of the use of defaultdict in _xml2dict. I'm not a big fan of this, but changing this now would break any downstream applications that use pyxdf, so it's there to stay.
Do you absolutely need to have your stream headers "flatted" or "delistified"?

On that same train of thought, if you do build out parse_chunks further, and eventually have it handle data and build concatenated arrays, would you be satisfied if the format of the completed streams was the same as that currently returned by load_xdf? If yes, then maybe we can move the stream-building code out of load_xdf and into parse_chunks, then parse_chunks can be used by in load_xdf, with an added kwarg to not flatten the stream header dict values.

Please take a look at what I have in place now and let me know how you would like to proceed.

I think we should leave fixing up the 'verbose' usage until a later issue.

cbrnr · 2020-12-30T19:24:14Z

@cboulay I will need to dig a little bit into my code to see what I really need. This will take some time, and since I'm technically on vacation right now it would be great if you could give me two weeks or so to check on this. For now, I think that streamlining the code base should take precedence over a specific use case I might have. After all, I can always pull out the functions I need into my other projects. But if possible, let's discuss this next year 😄 (i.e. in ~2 weeks).

cboulay · 2020-12-30T20:22:24Z

Of course! We need the headers-only feature sooner-than-later but we're happy go use the branch for now.
Happy New Year 🥳

agricolab · 2021-01-02T17:29:34Z

Generally speaking, i am totally ok in rewriting my code to accomodate this PR and think it is worthwhile.

I just have a few questions. One of the tests for my liesl toolbox failed, for a functtionality where I am using parse_chunks to inspect xdf files from the cli. That is easy to accomodate with a try catch and import parse_stream_header_chunks instead. But then it complains that there is also no source_id anymore in the output of the new parse_stream_header_chunks? I can dig a little bit deeper into the code base, but if you can tell me quickly what would be the process to recover the output of the old parse_chunks with the new functions, that'd save me some time and brain juice. Thanks!

cboulay · 2021-01-02T19:34:35Z

@agricolab - Based on what Clemens said, I think I should rename parse_stream_header_chunks back to parse_chunks, so you don't need to worry about that.

Let me explain the organization as I see it before we decide what to do.

Path 1 - using load_xdf:

open_xdf unzips if necessary, reads and checks the magic code
_read_chunks generator function yields a dict per chunk.
i. Error checking
ii. Basic processing to put the data into the dict (e.g., _xml2dict, _normalize_header, _read_chunk3, using StreamData to track timestamps).
iii. kwargs to control select_streams, stream_headers_only
* select_streams uses match_streaminfos with only a single stream.
Yielded chunks are further processed in load_xdf to build up temp_stream_data.
Optional synchronize clocks
Optional dejitter

Path 2 - using resolve_streams

parse_xdf
open_xdf as above
_read_chunks as above
parse_chunks

Note that we should not change the format returned by Path1 as this would break 3rd party applications we don't control.

Remaining things to change:

Move the "extra" processing that happens in load_xdf to a function named parse_chunk
Use parse_chunk inside both load_xdf and parse_chunks
As much as we can, change the data formats in Path 2 so it's similar to that returned by Path 1.
- Where they cannot be changed, modify in Path 2.
- skip_desc can probably be removed because I don't think Clemens actually needs to exclude desc from the info.
Somewhere along the way I guess I lost 'source_id', or at least it's not in the same location as previously. Which is correct?

After the above changes, the main difference between Path 1 and Path 2 is that Path 2 loads all the chunks before parsing them, whereas Path 1 parses them as they come in and aggregates them after they've all been parsed. Path 1 also has optional timestamp fixes.

With select_streams and stream_headers_only, there's really not much of a performance difference between the two paths.

tstenner · 2021-01-04T19:44:02Z

I haven't tested the PR yet, but from what I've read I'm all for it.

I have only two things to add:

_read_chunks should yield classes (i.e. an Iterable[Chunk] with Chunk as below) so it's possible to add type info for IDEs and Cython. These could be shared with a not yet existing python xdf writer:

class Chunk:
    tag: int
    header_len: int
    payload_len: int

_normalize_header discards all fields not listed and is quite verbose. The default fields that don't need type conversion could be added with a list comprehension, i.e.

fields = ['name', 'type']
res = { key: header.get(key,[None]) for key in fields}
res['channel_count'] = [int(header['channel_count'][0]]

cboulay · 2021-03-31T21:43:29Z

@cbrnr Have you had a chance to look at this? Can you please take a look and make sure that your use-cases work or suggest how to make it work?

cbrnr · 2021-04-01T08:49:50Z

@cboulay sorry, I completely forgot. I will take a look today.

cbrnr · 2021-04-01T12:38:09Z

@cboulay can you maybe rename the function back to parse_chunks? Otherwise, I immediately run into an error because I need this function. After that, testing will be a bit easier for me.

cboulay · 2021-04-01T12:45:03Z

I’m unavailable today. Please go ahead and make that change on your end and I’ll resolve later.

cbrnr · 2021-04-01T18:18:53Z

Alright, all of my files still work with this PR. In addition to comments already mentioned, I suggest that we make open_xdf a private function (i.e. rename it to _open_xdf). Furthermore, the public API is currently a bit confusing with the following functions (I'm already excluding open_xdf):

load_xdf: the main function which loads an XDF file
match_streaminfos: find stream IDs matching specified criteria
resolve_streams: resolve streams in given XDF file, i.e. return information on all streams
parse_xdf
parse_chunks

I'm using parse_chunks(parse_xdf(fname)) only to extract stream infos, so we could maybe simply expose one function that returns the composition of the two. Not sure if anyone else is using one or both of these functions separately though. I'm also wondering if this could be completely replaced by resolve_streams - I'll have to take a look.

@cboulay could you compile a new to do list summarizing all the points mentioned in this thread? It doesn't mean that you have to implement all of the suggestions, but I'd like to discuss all points with everyone.

chkothe

Looks good to me!

cbrnr · 2021-05-04T10:05:44Z

BTW, test_load_from_memory fails with a KeyError because testfile does not have an "info" field. This test works on master.

cboulay added 2 commits December 29, 2020 16:12

deduplication step 1 - build out _read_chunks to read all data but …

f8c6544

…retain stream-headers-only for resolve_streams.

deduplication step 2:

df22e45

* main load func uses shared _read_chunks * support stream_headers_only * support select_streams on first pass

cboulay requested review from tstenner, cbrnr, chkothe and agricolab December 29, 2020 23:26

Ensure parsed stream header retains stream_id.

51a5547

cbrnr mentioned this pull request Dec 30, 2020

Using select_streams argument traverses file twice #60

Open

cboulay mentioned this pull request Dec 30, 2020

Fix robust_fit for large timestamps by subtracting offset #73

Merged

Merge remote-tracking branch 'origin/master' into dedup

aace769

Revert parse_chunks renaming

c037ae5

chkothe approved these changes Apr 6, 2021

View reviewed changes

cboulay mentioned this pull request Sep 3, 2021

jooc: why are some values in stream meta-data lists of dicts and some just dicts? #77

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplication, support select_streams on single pass, support headers-only #74

Deduplication, support select_streams on single pass, support headers-only #74

cboulay commented Dec 29, 2020

cbrnr commented Dec 30, 2020

cboulay commented Dec 30, 2020

cbrnr commented Dec 30, 2020

cboulay commented Dec 30, 2020

agricolab commented Jan 2, 2021 •

edited

Loading

cboulay commented Jan 2, 2021 •

edited by cbrnr

Loading

tstenner commented Jan 4, 2021

cboulay commented Mar 31, 2021

cbrnr commented Apr 1, 2021

cbrnr commented Apr 1, 2021

cboulay commented Apr 1, 2021

cbrnr commented Apr 1, 2021

chkothe left a comment

cbrnr commented May 4, 2021

Deduplication, support select_streams on single pass, support headers-only #74

Are you sure you want to change the base?

Deduplication, support select_streams on single pass, support headers-only #74

Conversation

cboulay commented Dec 29, 2020

cbrnr commented Dec 30, 2020

cboulay commented Dec 30, 2020

cbrnr commented Dec 30, 2020

cboulay commented Dec 30, 2020

agricolab commented Jan 2, 2021 • edited Loading

cboulay commented Jan 2, 2021 • edited by cbrnr Loading

tstenner commented Jan 4, 2021

cboulay commented Mar 31, 2021

cbrnr commented Apr 1, 2021

cbrnr commented Apr 1, 2021

cboulay commented Apr 1, 2021

cbrnr commented Apr 1, 2021

chkothe left a comment

Choose a reason for hiding this comment

cbrnr commented May 4, 2021

agricolab commented Jan 2, 2021 •

edited

Loading

cboulay commented Jan 2, 2021 •

edited by cbrnr

Loading