Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

argon2-opencl fix for macOS #5420

Merged
merged 1 commit into from
Apr 6, 2024
Merged

Conversation

magnumripper
Copy link
Member

The Apple driver doesn't let us use inline assembler.

With this patch:

Device 3: AMD Radeon Pro Vega 20 Compute Engine
Benchmarking: argon2-opencl [Blake2 OpenCL]... |
Trying to compute 62 hashes at a time using 992 of 4080 MiB device memory
LWS=[64-192] GWS=[1984-5952] ([10-93] blocks) => Mode: LOCAL_MEMORY
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 4096, cost 3 (p) of 1, cost 4 (type [0:Argon2d 1:Argon2i 2:Argon2id]) of 0 and 1
Raw:	1099 c/s real, 15942 c/s virtual

There are other problems I haven't looked into yet, with other devices:

Device 1: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Testing: argon2-opencl [Blake2 OpenCL]... 0: OpenCL CL_INVALID_BUFFER_SIZE (-61) error in opencl_argon2_fmt_plug.c:608 - Error creating memory buffer
Device 2: Intel(R) UHD Graphics 630
Testing: argon2-opencl [Blake2 OpenCL]... Binary Build log: <program source>:68:7: warning: no previous prototype for function 'u64_shuffle'
ulong u64_shuffle(ulong v, uint thread_src, uint thread, __local ulong *buf)
      ^
<program source>:115:7: warning: no previous prototype for function 'block_th_get'
ulong block_th_get(const struct block_th *b, uint idx)
      ^
<program source>:125:6: warning: no previous prototype for function 'block_th_set'
void block_th_set(struct block_th *b, uint idx, ulong v)
     ^
<program source>:133:7: warning: no previous prototype for function 'mul_wide_u32'
ulong mul_wide_u32(ulong a, ulong b)
      ^
<program source>:149:6: warning: no previous prototype for function 'g'
void g(struct block_th *block)
     ^
<program source>:171:6: warning: no previous prototype for function 'apply_shuffle_shift2'
uint apply_shuffle_shift2(uint thread, uint idx)
     ^
<program source>:182:6: warning: no previous prototype for function 'shuffle_block'
void shuffle_block(struct block_th *block, uint thread, __local ulong *buf)
     ^
<program source>:260:6: warning: no previous prototype for function 'next_addresses'
void next_addresses(struct block_th *addr, uint thread_input, uint thread, __local ulong *buf)
     ^
<program source>:295:6: warning: no previous prototype for function 'blake2b_block'
void blake2b_block(ulong m[16], ulong out_len, ulong message_size)
     ^

FAILED (cmp_one(10))

The Apple driver doesn't let us use inline assembler.
@magnumripper magnumripper requested a review from solardiz January 8, 2024 01:23
Copy link
Member

@solardiz solardiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is reasonable, but I think changes are needed - made comments.

@@ -68,7 +68,7 @@ ulong u64_shuffle_warp(ulong v, uint thread_src)
ulong u64_shuffle(ulong v, uint thread_src, uint thread, __local ulong *buf)
#endif
{
#if USE_WARP_SHUFFLE && gpu_nvidia(DEVICE_INFO) && SM_MAJOR >= 3
#if USE_WARP_SHUFFLE && !__OS_X__ && gpu_nvidia(DEVICE_INFO) && SM_MAJOR >= 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can't use inline asm even on NVIDIA, then the equivalent of this change should be on host, because the #else path this triggers here requires local memory and we conditionally provide its allocation size from host (on NVIDIA, we currently don't).

@@ -97,7 +97,7 @@ ulong u64_shuffle(ulong v, uint thread_src, uint thread, __local ulong *buf)
// TODO: Test on other device types to add support
#if !gpu_nvidia(DEVICE_INFO) && !gpu_amd(DEVICE_INFO)
barrier(CLK_LOCAL_MEM_FENCE);
#elif gpu_amd(DEVICE_INFO) && DEV_VER_MAJOR < 2500
#elif !__OS_X__ && gpu_amd(DEVICE_INFO) && DEV_VER_MAJOR < 2500
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised the asm here was reached for you - we only reach it on ancient AMD driver. Maybe the real issue is we somehow fail to set DEV_VER_MAJOR on macOS? Anyway, I don't mind this change.

@solardiz
Copy link
Member

solardiz commented Jan 8, 2024

Thank you for testing this on macOS, and I'm happy to see you're active with the project again.

The Apple driver doesn't let us use inline assembler.

Maybe that's a property of the AMD driver? e.g. AMDGPU-Pro on super doesn't let us use inline asm either (but manages fine without a compiler memory barrier), which is why we have the version check here.

There are other problems I haven't looked into yet, with other devices:

That's interesting. This format is known to fail on CPU devices and Intel HD Graphics, but for me it fails differently - with FAILED (cmp_one(1)), see #5417.

FAILED (cmp_one(10))

That's the very last test vector, meaning the code almost worked for you on HD Graphics.

@magnumripper
Copy link
Member Author

The Apple drivers never seemed to share a single line of code with Windows/Linux. I've only used macOS with older nvidia but I'm sure it never had inline asm. Also, none of the proprietary other things ever worked, such as temp/fan sensors, querying compute capability (unless CUDA), cl_amd_media_ops, nothing of the sorts. Also there were never any similarity whatsoever in version numbers. The driver version number is just 1.2 (for OpenCL 1.2) or sometimes 1.1, and a date. This means things like DEV_VER_MAJOR < 2500 will always match...

Anyway I'll have a look at that host side thing, and perhaps also see if I can find some old nvidia macbook, just because why not. And a look-see at the Intel UHD Grahpics thing, in case it's close to fully working.

@solardiz
Copy link
Member

solardiz commented Jan 9, 2024

FWIW, I just noticed this in hashcat, could also be relevant to getting this format fully working on macOS:

            // work around, for some reason apple opencl can't have buffers larger 2^31
            // typically runs into trap 6
            // maybe 32/64 bit problem affecting size_t?
            // this seems to affect global memory as well no just single allocations
            if ((device_param->opencl_platform_vendor_id == VENDOR_ID_APPLE) && (device_param->is_metal == false))
            {
              const size_t undocumented_single_allocation_apple = 0x7fffffff;
              if (((c + 1 + 1) * MAX_ALLOC_CHECKS_SIZE) >= undocumented_single_allocation_apple) break;
            }

@magnumripper Perhaps you can try getting it to allocate more than 2 GiB on your laptop and see what happens? For this experiment, I'd try increasing the 6 here in opencl_argon2_fmt_plug.c:

        // Use almost all GPU memory by default
        unsigned int warps = 6, limit, target;

or do a real cracking run with a hash requiring more memory (e.g., one of our test vectors that use 16 MiB, or you can also increase the value of m in there - no need for the hash to remain crackable).

@solardiz solardiz merged commit 3b26d3b into openwall:bleeding-jumbo Apr 6, 2024
31 of 32 checks passed
@solardiz
Copy link
Member

solardiz commented Apr 6, 2024

This was sitting here for 3 months with no progress, so I went ahead and merged it as-is, then made my own follow-up commit c042fa3 to fix the NVIDIA host part (untested). Consistent with how we do it elsewhere in the codebase, on host I check for __APPLE__. I notice in OpenCL we check for __OS_X__. I hope this is right, and will match on the same systems, @magnumripper? Is there a good way for us to make this more reliable in case of future changes? For example, if Apple releases another OS where it sets __APPLE__, but does not set __OS_X__ in OpenCL? BTW, I have no idea what macros they set in iOS, which we don't currently support here.

I did not try addressing the 2G single allocation limit mentioned above. I think this needs testing first.

@claudioandre-br
Copy link
Member

I notice in OpenCL we check for OS_X. I hope this is right, and will match

This is something we created and control at:

#ifdef __APPLE__

Plus:

"-D__OS_X__ ",

@magnumripper magnumripper deleted the jan2024 branch July 17, 2024 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants