Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added StreamLoad op #2044

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

johnplatts
Copy link
Contributor

Added the StreamLoad op as SSE4/AVX2/AVX3/PPC have non-temporal aligned load instructions for vectors that are 16 bytes or larger and as SVE has non-temporal load instructions for all vector sizes.

@jan-wassenberg
Copy link
Member

I'm concerned about performance and correctness on x86. _mm_stream_load_si128 is super slow (hundreds of cycles) and only really intended for WC memory i.e. memory mapped I/O. It does seem useful for drivers that actually do want to bulk-load from WC: https://community.intel.com/t5/Intel-ISA-Extensions/Do-Non-Temporal-Loads-Prefetch/m-p/1027104
Is that the intended use case?

If so, then we also have errata HSD162, BDM116 and SKL079 to deal with, concerning ordering with respect to LOCK and MFENCE. Possibly we can just document that.

If it's rather the hope that when we load from normal WB memory, that the cache line is marked as preferred for discarding, do we have evidence of a benefit? The past few times I've tried this and similar things, I was disappointed.

Possible options: rely on prefetches to set the hint we'd like before the actual load, and/or make the x86 StreamLoad equivalent to Load if you'd still like to target the SVE instruction. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants