-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
John on big/little architectures #5435
Comments
Hi @ukasz. Great to see you playing with John again. Of course, we do have this issue. We let There's no easy fix given our current architecture. Ideally, we'd let threads or processes run asynchronously, yet re-allocate work between them once in a while or have them allocate candidate password batches from a shared pool dynamically. We need this not only for supporting bigLITTLE. |
John's usage of OpenMP is straightforward, and this is how OpenMP behaves for parallel For parallel loops, OpenMP defaults to static scheduling, meaning each thread is allocated the same share of work, and then threads just wait (passively or actively, which you can control). It may also support dynamic scheduling, which is similar to what I wrote above in "let threads [...] run asynchronously, yet re-allocate work between them once in a while" (but at a lower level and higher re-allocation frequency and thus higher overhead than I'd prefer). You can try |
Hi @solardiz, I wanted to play with sha256 specifically, because I noticed some improvement Moving back to the scaling, do you know the reason behind poor scaling of those formats? OpenMP still gives way simpler programming interface compared to pthreads, so with larger keys per crypt values and assumption that threads are bound to the CPUs, we could mesure performance of each thread between crypt_all calls (a moving average of the last few calls) to better split work manually. |
I took the SHA-NI topic to its own issue, #5437.
Yes, it's primarily that our candidate password generation is single-threaded and scalar and executes only sequentially with the hashing. So e.g. for SHA-256 on AVX2 even with just 1 thread, we first sequentially do at least 8 times One way to mitigate this without a complete redesign would be to enhance our formats interface to support two sets of keys/hashes and alternate between them, and have a candidate password generation / We also have a similar bottleneck with hash comparisons. Technically, we can have a parallel A solution is to implement built-in mask mode support in CPU formats like this. We currently have on-device mask mode support in some GPU and FPGA formats. This moves most candidate password generation and hash comparisons within the format's parallel processing. This is efficient, but it only works for mask mode (including when used on top of another mode) and it requires per-format code (or formats invoking shared code).
There's really no point OpenMP'ing this format as it currently is, on current CPUs. Ditto for other unsalted fast hash formats. For your experiments, I recommended that you use an iterated format. If you want to play specifically with SHA-256, then use e.g., PBKDF2-HMAC-SHA256 and multiply the speed by its iteration count and by 2 (for HMAC) to obtain SHA-256 speed.
With OpenMP, both measuring performance of each thread and splitting the work manually is tricky. While I can think of hacks to do that, OpenMP's simplicity is lost if we use those. |
Actually, We could want to add the |
I ran some benchmarks on Intel 13900 and I have a feeling that the way john spreads work among threads is not optimal on machines with two types of cores. It seems like it is assuming that it will take similar amount of time for each set of hashes to be computed, which is not true if we have two way different types of cores. I suspect that P cores are done and they are waiting in synchronization point for E cores. The difference in performance is 2x, so they wait for half of the time.
I am not sure if this is john, or OpenMP's fault to be honest, without more digging, but I wanted to leave a note before I forget about that.
All cores:
OMP_NUM_THREADS=32 ./john -test -format=raw-sha256
Will run 32 OpenMP threads
Benchmarking: Raw-SHA256 [SHA256 256/256 AVX2 8x]... (32xOMP) DONE
Raw: 108770K c/s real, 3432K c/s virtual
P cores only - no HT (116K with HT)
OMP_PLACES='{0},{2},{4},{6},{8},{10},{12},{14}' OMP_NUM_THREADS=8 ./john -test -format=raw-sha256
Will run 8 OpenMP threads
Benchmarking: Raw-SHA256 [SHA256 256/256 AVX2 8x]... (8xOMP) DONE
Raw: 118947K c/s real, 14868K c/s virtual
E cores only:
OMP_PLACES='{16},{17},{18},{19},{20},{21},{22},{23},{24},{25},{26},{27},{28},{29},{30},{31}' OMP_NUM_THREADS=16 ./john -test -format=raw-sha256
Will run 16 OpenMP threads
Benchmarking: Raw-SHA256 [SHA256 256/256 AVX2 8x]... (16xOMP) DONE
Raw: 80871K c/s real, 5045K c/s virtual
The text was updated successfully, but these errors were encountered: