Slow parse_collection due to I/O synchronization #398
Labels
bug
Something isn't working
enhancement
New feature or request
performance
Efficiency improvements
priority:high
Describe the bug
Ingesting plaintext records via stdin seems very slow (
12
MiB/sec) even though60
worker threads are used. I suspect this has to do with https://en.cppreference.com/w/cpp/io/ios_base/sync_with_stdioBefore I disabled synchronization most threads were idle and I could only parse with
12
MiB/sec. After I disabled sync I was able to ingest with full speed (speed of the decompressor) and all threads were utilized.Not sure if this is a safe thing to do though..
To Reproduce
Steps to reproduce the behavior:
find /storage/col-zstd/ -name '*.zst' -exec zstdcat {} \; | pv | ./parse_collection -o /storage/col_pisa_idx/forward -j 60 -f plaintext --stemmer krovetz -b 100000
12
MiB/sec.perf top
shows that the process is I/O bound as most time is spent inlibc::getc
andlibc::ungetc
. Only3
cores are utilized.std::ios::sync_with_stdio(false);
toparse_collection
and observe speeds go to350
MiB/sec (zstd decompression speed) and all 60 cores are utilized.Expected behavior
Unclear if this optimization is safe. It appears to be as only one thread ever reads from stdin before work is passed to the different worker threads?
Environment info
Operating System: Ubuntu 20.04
Compiler: g++ (Ubuntu 9.3.0-10ubuntu2) 9.3.0
The text was updated successfully, but these errors were encountered: