Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow parse_collection due to I/O synchronization #398

Open
mpetri opened this issue Jun 1, 2020 · 3 comments
Open

Slow parse_collection due to I/O synchronization #398

mpetri opened this issue Jun 1, 2020 · 3 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request performance Efficiency improvements priority:high

Comments

@mpetri
Copy link

mpetri commented Jun 1, 2020

Describe the bug
Ingesting plaintext records via stdin seems very slow (12MiB/sec) even though 60 worker threads are used. I suspect this has to do with https://en.cppreference.com/w/cpp/io/ios_base/sync_with_stdio

Before I disabled synchronization most threads were idle and I could only parse with 12MiB/sec. After I disabled sync I was able to ingest with full speed (speed of the decompressor) and all threads were utilized.

Not sure if this is a safe thing to do though..

To Reproduce
Steps to reproduce the behavior:

  1. Ingest data in plain text format: find /storage/col-zstd/ -name '*.zst' -exec zstdcat {} \; | pv | ./parse_collection -o /storage/col_pisa_idx/forward -j 60 -f plaintext --stemmer krovetz -b 100000
  2. Observe speeds reporter by pv of 12 MiB/sec. perf top shows that the process is I/O bound as most time is spent in libc::getc and libc::ungetc. Only 3 cores are utilized.
  3. Add std::ios::sync_with_stdio(false); to parse_collection and observe speeds go to 350 MiB/sec (zstd decompression speed) and all 60 cores are utilized.

Expected behavior

Unclear if this optimization is safe. It appears to be as only one thread ever reads from stdin before work is passed to the different worker threads?

Environment info
Operating System: Ubuntu 20.04
Compiler: g++ (Ubuntu 9.3.0-10ubuntu2) 9.3.0

@mpetri mpetri added the bug Something isn't working label Jun 1, 2020
@JMMackenzie JMMackenzie added enhancement New feature or request performance Efficiency improvements labels Jun 1, 2020
@amallia amallia self-assigned this Jun 1, 2020
@elshize
Copy link
Member

elshize commented Jun 1, 2020

Thanks @mpetri for reporting. Yes, this should be totally safe. The input is read from stdin only by the main thread and then batches are delegated to the workers, and these write do different files, after which one thread merges them. So there should be no races.

If you notice any problems, let us know.

@mpetri
Copy link
Author

mpetri commented Jun 1, 2020

After creating an index with this enabled everything looked okay. So you are right it looks like it is safe. The main question would be where to enable this in the program? At the start or only at the beginning of the while loop in the forward_index_builder

@elshize
Copy link
Member

elshize commented Jun 1, 2020

I think we should do it in the parse_collection tool itself. Right now, it would probably not make any difference, but I'd rather not change global state in one of the algorithms, it might lead to some surprising bugs down the road.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request performance Efficiency improvements priority:high
Projects
None yet
Development

No branches or pull requests

4 participants