Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ascii.rs optimization since ptr is aligned #193

Open
danakj opened this issue Oct 23, 2024 · 1 comment
Open

ascii.rs optimization since ptr is aligned #193

danakj opened this issue Oct 23, 2024 · 1 comment

Comments

@danakj
Copy link

danakj commented Oct 23, 2024

On line 186, there is a call to_mm_loadu_si128:

let chunk = _mm_loadu_si128(ptr as *const __m128i);

That call does not require the pointer to be aligned at all. But it could be replaced with _mm_load_si128, which requires alignment to a 16-byte boundary.

On line 139, the ptr is aligned to the VECTOR_ALIGN mask, which aligns it to size_of::<__m128i>(), or to 16 bytes.

bstr/src/ascii.rs

Lines 120 to 121 in 3dc5939

const VECTOR_SIZE: usize = core::mem::size_of::<__m128i>();
const VECTOR_ALIGN: usize = VECTOR_SIZE - 1;

https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L139C9-L139C12

The ptr is always advanced by VECTOR_LOOP_SIZE in a loop, which is a multiple of 16 bytes:

bstr/src/ascii.rs

Lines 120 to 122 in 3dc5939

const VECTOR_SIZE: usize = core::mem::size_of::<__m128i>();
const VECTOR_ALIGN: usize = VECTOR_SIZE - 1;
const VECTOR_LOOP_SIZE: usize = 4 * VECTOR_SIZE;

https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L180C36-L180C52

And then further advanced by VECTOR_SIZE which is 16 bytes:

const VECTOR_SIZE: usize = core::mem::size_of::<__m128i>();

ptr = ptr.add(VECTOR_SIZE);

So in the loop at L186, the pointer is always aligned to 16 bytes:

let chunk = _mm_loadu_si128(ptr as *const __m128i);

The aligned version of the function was pointed out by @anforowicz during unsafe code audit: https://chromium-review.googlesource.com/c/chromium/src/+/5925797/comment/f08dc00c_1b24061c/

@BurntSushi
Copy link
Owner

Hmmm I think actually there's probably a bug here. We probably should be making use of unaligned loads here to avoid the slow byte-at-a-time loop. Like this:

https://github.com/BurntSushi/memchr/blob/a26dd5b590e0a0ddcce454fb1ac02f5586e50952/src/arch/generic/memchr.rs#L219-L228

(Although that interestingly does use an unaligned load just before when it could be an aligned load.)

Aligned versus unaligned rarely makes a difference on x86-64 in my experience anyway, and especially here where this is just the "cleanup" aspect of the implementation.

But good catch. I agree that as the code is currently written, it can be safely an aligned load.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants