You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A couple of look up tables would be useful. To keep things simple, start with a floating point -> floating point lookup table. This is a new kernel type that might have the signature volk_32f_x2_s32f_lut_32f(float *output_vector, float *lut, float *input_vector, float max_input, unsigned int num_points).
This would also need a puppet that tests it on something like a cosine function.
AVX2 introduces a gather instruction that might make a LUT vectorizable (__m256 _mm256_i32gather_ps (float const* base_addr, __m256i vindex, const int scale)). It is currently unknown if vectorizing a LUT would have any advantage over a generic LUT.
Disclaimer: this is a fairly hefty project
The text was updated successfully, but these errors were encountered:
what should a float->float LUT do? Return value only if key bitwise identical? if key identical in terms of == (e.g. -0f == 0f)? Nearest neighbor? linear interp?
I played around with the gather intrinsics, and they seem vastly useful and if I understood the throughput correctly, really an advantage, so hey, that does sound interesting, but:
personally, i think a float->float lut might be most effective with nearest neighbor. in many systems, the granularity of the data is known, so linear interpolation would be overkill
LUTs can be very useful, and the various gather intrinsics can be really useful too for various lookups. It would be interesting to see the speed difference in using a generic kernel versus one using the gather. Let's keep this issue around even if nobody will be getting to it any time soon, as a reminder to us that it would be interesting to investigate some day.
A couple of look up tables would be useful. To keep things simple, start with a floating point -> floating point lookup table. This is a new kernel type that might have the signature
volk_32f_x2_s32f_lut_32f(float *output_vector, float *lut, float *input_vector, float max_input, unsigned int num_points)
.This would also need a puppet that tests it on something like a cosine function.
AVX2 introduces a gather instruction that might make a LUT vectorizable (
__m256 _mm256_i32gather_ps (float const* base_addr, __m256i vindex, const int scale)
). It is currently unknown if vectorizing a LUT would have any advantage over a generic LUT.Disclaimer: this is a fairly hefty project
The text was updated successfully, but these errors were encountered: