Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: sample chunk type optimized for vectorized reading #19

Open
tstenner opened this issue Jul 18, 2018 · 25 comments
Open

ENH: sample chunk type optimized for vectorized reading #19

tstenner opened this issue Jul 18, 2018 · 25 comments

Comments

@tstenner
Copy link

tstenner commented Jul 18, 2018

Current state of chunk 3

The current sample chunk type (tag 3) requires at least one branch per sample (2 for string samples) so both the Python and Matlab implementation require compiled extensions.

Also, the current code to "compress" timestamps isn't helping much due to the usual floating point equality problems, even under perfect conditions:

In [1]: import numpy as np

In [2]: ts = np.arange(151500, 151600, 1/1000)

In [3]: np.sum(np.diff(ts)==1/1000)
Out[3]: 0

In [4]: np.diff(ts)
Out[4]: array([0.001, 0.001, 0.001, ..., 0.001, 0.001, 0.001])

In [5]: np.diff(ts)-1/1000
Out[5]: 
array([-1.07102096e-11, -1.07102096e-11, -1.07102096e-11, ...,
       -1.07102096e-11, -1.07102096e-11, -1.07102096e-11])

In [6]: np.mean(np.diff(np.arange(0, 10, 1/100))==1/100)
Out[6]: 0.002 # 0.2% gets actually left out

Proposal: new chunk type

My proposal for a new chunk type therefore looks as follows:

[Tag 7, uint16] [streamid, uint32] [n_samples uint32] [n_channels uint32] [typeid uint8] # chunk header
[timestamps, n_samples * double64]                    # timestamps, 0: left out
[samples, n_channels * n_samples * channel_type]      # sample data

Each chunk would require at most 3 reads:

streamid, n_samples, n_channels, typeid = struct.unpack('<IIIB', f.read(13))
stream_type = stream_info[streamid].type
timestamps = np.frombuffer(f.read(num_samples*8), np.float64)
data = np.frombuffer(f.read(num_samples * stream.samplebytes), stream.dtype).reshape(n_samples, n_channels)

The chunk includes sample count, channel count and data type for increased robustness (errors can be detected earlier) and so data can be read without an XML parser.

For the timestamp array, a value of 0 (0x0000000000000000 in IEEE 754 representation) indicates a left out timestamp that would be recomputed as usual. Since this chunk format is more compression friendly due to the reduced number of reads / writes, a gzip compressed array would also require less space.

Implementation

The writer is implemented in https://github.com/labstreaminglayer/App-LabRecorder/tree/xdfwriter_chunk7. The main serialization code looks as follows:

_write_chunk_header(chunk_tag_t::samples_v2, len, &streamid);
write_little_endian(file_, n_channels);
write_little_endian(file_, n_samples);
write_little_endian(file_, sampletype);
write_sample_values(file_, timestamps);
write_sample_values(file_, chunk, n_samples * n_channels);

The reader is implemented in https://github.com/tstenner/xdf/commit/412998f3ed66cc7e85a93d77a1d7c96f81073ff0

Benchmark (more below)

A single XDF file with one 64-channel EEG (double) stream was created with 30 chunks, each containing 10K samples.

          Reading (xdfz) Reading (xdf)
Chunk 3:  3.094s         1.704
Chunk 7:  0.961          0.399

In both cases, the new chunks are smaller (156.3 vs. 156.0 MiB, 36.0 vs. 34.9 MiB compressed with gzip -6).

Open questions

[ ] layout for string chunks
[ ] optional padding so data starts at page boundaries (e.g. for mmap)?
[ ] should the redundant type index be included?

@tstenner
Copy link
Author

tstenner commented Oct 8, 2018

CC @chkothe who wanted to see an example implementation.

@tstenner
Copy link
Author

Also CC @agricolab, @cbrnr since this might be useful for a pure python reader.

@cbrnr
Copy link
Contributor

cbrnr commented Apr 2, 2019

I'm not sure a new chunk type is the best solution. Have you tried Cythonizing the existing type 3 reader?

@tstenner
Copy link
Author

tstenner commented Apr 8, 2019

If experimented with both Cython and pybind11 (C++ to parse the data chunks), and while the speedups were impressive they were still slower (not to mention more complex) than the pure python implementation sketched above. I'll upload the test code and data files some time next week.

With liblsl we've had our fair share of binary distribution problems (both for Python and Matlab), so a chunk type that enables both fast loading with the base language as well as the option to mmap (parts of) streams into memory would allow for a more portable use of the format, not to mention that the additional fields allow for a lot more error checking than currently possible.

@cboulay
Copy link
Collaborator

cboulay commented Apr 8, 2019

There’s a lot of concern about feature parity with the matlab importer.

Since it is still early stages, libxdf can be created in a way that keeps it manylinux2010 compatible.

@cboulay
Copy link
Collaborator

cboulay commented Apr 8, 2019

Neither of my comments above are reasons in themselves to not change the chunk format. As long as things stay backwards compatible then I am for it.

@tstenner
Copy link
Author

tstenner commented Apr 9, 2019

From my point of view, it's absolutely necessary that all three readers (C++ libxdf, xdf-Python and xdf-Matlab) have a reference implementation before anything is set in stone. For Matlab, the fread function should make the chunk reader code about as simple and fast as the python code I posted above.

@tstenner
Copy link
Author

tstenner commented Apr 12, 2019

Testing with 10 chunks, 64 channels and 100000 samples each. You can find the code at https://github.com/tstenner/xdf/tree/chunk7 (readers) and https://github.com/labstreaminglayer/App-LabRecorder/tree/xdfwriter_chunk7 (writer; run test_chunk7 as test_chunk7 10 100000 0 0).

    64ch floats strings
Python plain 7.10  
  C++ ext. 2.67  
  chunk7 1.64  
Matlab plain 17.66 28.4
  mex 1.32 0.62
  chunk7 0.69 8.6 1.61

So even though compiled extensions speed everything up considerable, the builtin compiled binary readers are even faster.

Neither of my comments above are reasons in themselves to not change the chunk format. As long as things stay backwards compatible then I am for it.

Most readers either print an error message about the new chunk format or skip the chunks, but a conversion utility could help in the immediate transition. The (as far as I can see) feature complete matlab reader for the chunk 7 is only 37 lines long, so it's quite easy to implement it.

@tstenner
Copy link
Author

Added data for the Matlab string chunk reader

@cbrnr
Copy link
Contributor

cbrnr commented Apr 16, 2019

Can you summarize again why chunk3 reading is slow? I didn't really understand it from you initial post. Is it the nested nature of the samples structure (i.e. that you can't read the whole chunk in on go because you don't know how many bytes it consists of, mainly because of the presence/absence of timestamps)? Can LSL outlets stream chunk7 data? Is there more buffering going on for chunk7 vs. chunk3? What are the pros and cons of both chunk types?

Having said that, XDF is extensible as the name implies, so if there are good reasons why we need something in addition to chunk3 then we should go for it. Personally, I could live with 7 seconds loading time of the plain Python chunk3 reader, but I guess there will be much longer files.

The compiled extensions (Cython/MEX) are already much faster, so it is worth discussing if these are already fast enough. It is a bit odd that MEX is twice as fast as Cython, so I'm assuming that the Cython implementation could be further tweaked.

Furthermore, why is the MATLAB chunk7 reader more than twice as fast as the Python implementation?

@dmedine
Copy link
Contributor

dmedine commented Apr 16, 2019

Just a comment on the speed, Matlab has very well optimized matrix handling and is always working to improve performance there, so it's probably always going to be better than python's. As suggested in this post: https://stackoverflow.com/questions/2133031/is-matlab-faster-than-python Numpy can be compiled against ATLAS which will help. Don't know why it isn't distributed this way by default.

Still, 2x faster seems like a lot.

I still think that having a(n optimized) C backend is the best option here. For my money, what's lost in ease of distribution is gained in maintenance and performance. But, I'm not doing any work on this, so my vote doesn't count for much :P

@cbrnr
Copy link
Contributor

cbrnr commented Apr 16, 2019

Just a comment on the speed, Matlab has very well optimized matrix handling and is always working to improve performance there, so it's probably always going to be better than python's. As suggested in this post: https://stackoverflow.com/questions/2133031/is-matlab-faster-than-python Numpy can be compiled against ATLAS which will help. Don't know why it isn't distributed this way by default.

NumPy is also heavily optimized and they even use the same LAPACK/BLAS libraries in the background. Anaconda links NumPy against Intel MKL, which speeds up things even more (in most use cases). So like you're saying, I'm a bit skeptical that when MATLAB is 2x faster than Python/NumPy I tend to think that the Python code could probably be improved (and maybe it is the code outside of NumPy that could be optimized).

I still think that having a(n optimized) C backend is the best option here. For my money, what's lost in ease of distribution is gained in maintenance and performance. But, I'm not doing any work on this, so my vote doesn't count for much :P

Distributing C bindings is a lot of work for Python, because you need to create binary wheels for all platforms if you want pip install to just work. That's why I normally prefer pure Python or if necessary Cython. But of course then you need to write separate readers for all languages (like we currently do). So yes, having only one C reader with bindings to other languages would be a good option (LSL does that BTW) - but currently we don't have anyone with the time to work on that.

@dmedine
Copy link
Contributor

dmedine commented Apr 16, 2019

Chad managed to get pylsl working quite well on pip, but I gather it was no small effort and there have been a few hiccups along the way (e.g. certain versions not working on certain operating systems). Now it is running quite nicely. The way to do it properly would be to integrate CI the same way liblsl does to automate the process. This is a huge amount of work. The (rather obvious) question is whether or not it is less work than maintaining two different languages in parallel.

Again, I'm just talking here. I think I've contributed exactly 1 line of code to load_xdf.m over the years.

@cbrnr
Copy link
Contributor

cbrnr commented Apr 16, 2019

Chad managed to get pylsl working quite well on pip, but I gather it was no small effort and there have been a few hiccups along the way (e.g. certain versions not working on certain operating systems). Now it is running quite nicely. The way to do it properly would be to integrate CI the same way liblsl does to automate the process. This is a huge amount of work. The (rather obvious) question is whether or not it is less work than maintaining two different languages in parallel.

Yes, that's the question, and like you I'm also just talking here 😄. I can't answer this question, because I don't know the pyxdf roadmap nor who is currently actively involved. I'm planning to add XDF support to MNELAB, that's why I'm interested in the current state of development. And for my purposes, I'd very much prefer pure Python code (unless the pip package is working flawlessly like pylsl, then I don't care).

@tstenner
Copy link
Author

Can you summarize again why chunk3 reading is slow?

The use of varlength ints (lenlenght=f.read(1); len=f.read(lenlength);value=f.read(len);; for timestamps if(f.read(1) ts=f.read(8);) means that the data has to be read in tiny portions that depend on each other. Buffered reads can alleviate some of that pain, but it's still hard to properly optimize.

Can LSL outlets stream chunk7 data?

The chunk format is independent of the stream data, i.e. the same data could be saved as either chunk3 or chunk7 in the xdf file. The only difference between the chunk types is the metadata at the chunk beginning and the layout of the data.

Personally, I could live with 7 seconds loading time of the plain Python chunk3 reader, but I guess there will be much longer files.

I hadn't talked about it here before, but in theory there could be an xdf optimization tool that reorders and merges chunks so the data could be mmap()ed (and/or read only when needed for mne-python).

The compiled extensions (Cython/MEX) are already much faster, so it is worth discussing if these are already fast enough.

Speed is only half the problem – while playing with the format I've caused several crashes without even trying and a Matlab/Python handle corrupt files better (i.e. an exception instead of a segfault). Compiled extensions also need to be compiled and distributed and I've had more fun with that than I'd have liked.

It is a bit odd that MEX is twice as fast as Cython, so I'm assuming that the Cython implementation could be further tweaked.

Could be several things. It's a minimal prototype in C++ I didn't aim for speed with and I think it copies the data more often than needed. But even with the optimized mex file, the chunk3 reader is slower than a pure matlab implementation for chunk7.

Furthermore, why is the MATLAB chunk7 reader more than twice as fast as the Python implementation?

Good question, but I suppose it's because Matlab has an advantage with the JIT compiler for everything except the read operation.

what's lost in ease of distribution is gained in maintenance and performance

For Matlab, the chunk7 reader has ~37 lines of code, all of which can be debugged/profiled in Matlab and if something goes wrong the user gets an error message with an explanation and the line number.
The mex function is ~160 lines of C, slower, harder to debug and, according to Mathworks, should be compiled for each Matlab version.

Now it is running quite nicely.

For Windows an OS X, but for Linux we'd need to build against an older CentOS version that's hard to get nowadays.

The (rather obvious) question is whether or not it is less work than maintaining two different languages in parallel.

Based on the previous community contributions, most users know either Matlab or Python and a few know C / C++. I would trust most of them to tweak, test and distribute Matlab or Python code, but with C++ (especially with constraints w.r.t. distribution and cross-platform compatibility) it's going to be a lot harder.
The other problem is with the API/ABI: we couldn't use C++ and I think three readers (Matlab, Python and C++) are easier to maintain than a C++ reader with a C ABI and two reader interfaces.

@cboulay
Copy link
Collaborator

cboulay commented Apr 16, 2019

@tstenner There's some sort of ANSI standard associated with the XDF specification. I don't think the research community has to be locked into that specification forever, but I'm pretty sure the only chance a change to the file format will be accepted upstream is if all the major tools that use XDF are compatible with both formats in a completely invisible way, and that precludes using a converter tool. So the Python and Matlab importers will both have to be compatible with both formats, and LabRecorder should have an option to save in the old format (I think it's OK to default to the new format).

Side note: Using a converter is not a good option for some people (like me) who have clinical data. The raw data is sacred and can't be modified. Using a converter simply doubles the amount of disk space required and breaks the load-preprocess-save_to_h5 sequence into a slow step and a fast step, instead of one slow step. However, I'm sure a converter tool would be useful to many people though I think that could be put off until all of the importers are updated and accepted upstream.

@cbrnr I have large multi-GB files and the current import times are quite painful.

@cbrnr
Copy link
Contributor

cbrnr commented Apr 16, 2019

@cboulay a new chunk format won't speed up loading of existing chunk3 files, so an optimized chunk3 reader would probably be a good idea regardless of how we handle the proposed chunk7 type.

I don't know about any official standardization of XDF - do you have more details on that? The way I imagined the process of extending the XDF specification is that the community decides that e.g. chunk7 is necessary and this will be integrated into XDF 1.1 (currently we have XDF 1.0). Therefore, readers compatible with XDF 1.0 don't have to be compatible with the new extension, but due to the XDF format, they can just skip unknown chunk types.

@tstenner thanks for the summary. What is the disadvantage of chunk7? Do outlets need to buffer more or do more processing before pushing out samples in this format? Is the chunk7 type less flexible than chunk3?

Regarding compiled extensions, I think this is relatively painless if you use Cython. As I've mentioned before, I think we should try to optimize chunk3 reading regardless of our chunk7 discussion.

@tstenner
Copy link
Author

the only chance a change to the file format will be accepted upstream is if all the major tools that use XDF are compatible with both formats in a completely invisible way

I'm already on it :-)
The current process from my end is:

  1. write the specification
  2. reader support in the Python, Matlab and C++ reader (sigviewer)
  3. writer support in the LabRecorder xdfwriter
  4. back to 1. until everyone's happy
  5. get everyone to adopt the new reader libraries
  6. add the new chunk format as option in LabRecorder + a conversion tool
  7. some time later: change the default in LabRecorder

Side note: Using a converter is not a good option for some people (like me) who have clinical data.

Agreed, but users with clinical data are (hopefully) more technically inclined and have more resources so an additional compilation step is doable for them. Also, the chunk7 format is slightly more robust to errors (wrong offsets) so there's less risk of unreadable files.

What is the disadvantage of chunk7?

So far nobody (apart from me) uses it :-)
In some very narrow cases (see example above) the timestamps might need more space, but in practice this rarely happens. The buffer sizes are roughly the same - you can think of chunk7 as chunk3 with some bits reordered. (In theory, one could allocate space in an XDF file and write the timestamps and data as they come in, but I'm not sure it's worth it). Bigger chunks are a good idea with chunk7, but not necessary.

Regarding compiled extensions, I think this is relatively painless if you use Cython.

I took a shot at it and found it difficult to keep the normal implementation and the Cython annotations separate so it would run without Cython, whereas pybind11 was relatively easy to keep separate (and optional). For widespread usage, I agree that Cython is a better idea.

@cbrnr
Copy link
Contributor

cbrnr commented Apr 17, 2019

So far nobody (apart from me) uses it :-)
In some very narrow cases (see example above) the timestamps might need more space, but in practice this rarely happens. The buffer sizes are roughly the same - you can think of chunk7 as chunk3 with some bits reordered. (In theory, one could allocate space in an XDF file and write the timestamps and data as they come in, but I'm not sure it's worth it). Bigger chunks are a good idea with chunk7, but not necessary.

One major difference though is that in chunk7, you need to supply either all timestamps (for each sample), or none. In chunk3, each sample can have a timestamp or not. Can you think of a recording setup where this might be a problem? Or does a chunk7 writer automatically timestamp all samples according to the nominal sampling rate? How many samples would you typically write in a chunk7? AFAIK the main idea of chunk3 was to provide timestamps when the data are actually sent out, and this is not possible with every sample because most devices push out an array of samples at once.

I took a shot at it and found it difficult to keep the normal implementation and the Cython annotations separate so it would run without Cython, whereas pybind11 was relatively easy to keep separate (and optional). For widespread usage, I agree that Cython is a better idea.

I've never used pybind11, but yes, when you cythonize a module you can't easily keep a pure Python implementation simultaneously. I don't think that's even necessary (maybe just as a fallback), it's sufficient if we have a nice Cython implementation.

@tstenner
Copy link
Author

in chunk7, you need to supply either all timestamps (for each sample), or none

As in chunk7, timestamps can be left out (i.e. all 8 bytes are 0).
With more than 12.5% timestamps left out, chunk3 timestamps are smaller, but in even in perfect conditions without inaccuracies this doesn't happen due to floating point inaccuracies (see first code example) and it makes it impossible to jump to a specific timestamp and read everything in one vectorized operation.

So in chunk7, the chunk3 timestamp sequence [uint8 1] [double64 100.5] [uint8 1] [double64 100.6] [uint8 0] [uint8 1] [double64 100.8] would be [double64 100.5] [double64 100.6] [double64 0] [double64 100.8].

For high sampling rates, I could see the point of writing only a small subset of timestamps, but in my practical experience almost all timestamps get written anyway. For those, maybe a sparse representation ([uint32 num_written_ts] [uint32*num_written_ts index] [double64*num_written_ts timestamps]) would be better? For less than 50% of timestamps written this would be smaller, but it would take a lot more left of timestamps for it to be faster and worth it.

I was hoping on some experts (@cboulay ?) chiming in, if their experimental data benefits from having only a few timestamps saved.

when you cythonize a module you can't easily keep a pure Python implementation simultaneously

The C++ implementation only has one function for each data type that reads in the string and populates the timestamp and data numpy arrays. One module load, the reader tries to import these functions.

We could factor out the chunk reader functions and on module load try to load sped up replacements (either pybind11 or Cython) and monkeypatch the included functions.
Some parts of the code would be duplicated, the user would have the choice between the dependency free python codepaths and the fast cython replacements.

One other thing regarding disadvantages: Most readers (and writers) support 64bit lengths in theory, but fail in one way or another when they are actually in a file, so I capped the sample and channel counts to 32bit, i.e. 2^32-1. This will probably be quoted in the near future, but I think 4 billion samples per chunk are enough for everybody.

@cbrnr
Copy link
Contributor

cbrnr commented Apr 17, 2019

So in chunk7, the chunk3 timestamp sequence [uint8 1] [double64 100.5] [uint8 1] [double64 100.6] [uint8 0] [uint8 1] [double64 100.8] would be [double64 100.5] [double64 100.6] [double64 0] [double64 100.8].

Ah, now I get it. Left-out timestamps are represented as 0. I think this is good enough, I wouldn't be concerned about storage space. Practically speaking, if a device pushes out 1000 samples every second in a package, you could timestamp the first sample and have 999 zeros for the other samples. Correct?

We could factor out the chunk reader functions and on module load try to load sped up replacements (either pybind11 or Cython) and monkeypatch the included functions.
Some parts of the code would be duplicated, the user would have the choice between the dependency free python codepaths and the fast cython replacements.

I don't know what the best practice is here, but as long as it works and is maintainable I'm fine with this approach.

One other thing regarding disadvantages: Most readers (and writers) support 64bit lengths in theory, but fail in one way or another when they are actually in a file, so I capped the sample and channel counts to 32bit, i.e. 2^32-1. This will probably be quoted in the near future, but I think 4 billion samples per chunk are enough for everybody.

For the near future this will probably enough, but as you say this will probably catch up faster than you think. So if somehow possible, I'd go for 64bits - you are talking about two header fields n_channels and n_samples, right? You could use it in the chunk7 format even though currently you seem to have practical problems with such huge files, but it doesn't hurt to have it in the specs.

@tstenner
Copy link
Author

Practically speaking, if a device pushes out 1000 samples every second in a package, you could timestamp the first sample and have 999 zeros for the other samples. Correct?

Correct. A transparent compression algorithm (gzip? zstd?) would help with the file size issue, but so far nothing's been standardized in this matter.

you are talking about two header fields n_channels and n_samples, right?

Right. This is per-chunk, so even in the best-case (4^32 samples, one uint8 channel), the chunk alone would be at least 32GB of timestamps and 4GB of sample data big. Or to put it another way: with a sample rate of 1 million samples/second this would still be over an hour of data. End users, especially scientists are always full of surprises but it should be a long while before they reach those amounts of data.

@cbrnr
Copy link
Contributor

cbrnr commented Apr 17, 2019

Right. This is per-chunk, so even in the best-case (4^32 samples, one uint8 channel), the chunk alone would be at least 32GB of timestamps and 4GB of sample data big. Or to put it another way: with a sample rate of 1 million samples/second this would still be over an hour of data. End users, especially scientists are always full of surprises but it should be a long while before they reach those amounts of data.

Agreed, but can you still make these 2 fields 64bit? It surely doesn't hurt.

@tstenner
Copy link
Author

The channel count for lsl streams is already limited to 31 bits, so more wouldn't make sense for it to be larger in the xdf file.
I could bump the sample count to 64 bits, but chunks with more than 4 billion samples should probably be split up anyway.

@cbrnr
Copy link
Contributor

cbrnr commented Apr 17, 2019

Alright, then leave them as 32bits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants