avx

AVX2 byte gather with uint16 indices, into a __m256i

冷暖自知 提交于 2021-02-07 13:30:20
问题 I am trying to pack a __m256i variable with 32 chars from an array and specified by indices. here is my code: char array[]; // different array every time. uint16_t offset[32]; // same offset reused many times _mm256_set_epi8(array[offset[0]], array[offset[1]], array[offset[2]], array[offset[3]], array[offset[4]], array[offset[5]], array[offset[6]], array[offset[7]], array[offset[8]],array[offset[9]],array[offset[10]],array[offset[11]], array[offset[12]], array[offset[13]], array[offset[14]],

AVX2 byte gather with uint16 indices, into a __m256i

走远了吗. 提交于 2021-02-07 13:28:26
问题 I am trying to pack a __m256i variable with 32 chars from an array and specified by indices. here is my code: char array[]; // different array every time. uint16_t offset[32]; // same offset reused many times _mm256_set_epi8(array[offset[0]], array[offset[1]], array[offset[2]], array[offset[3]], array[offset[4]], array[offset[5]], array[offset[6]], array[offset[7]], array[offset[8]],array[offset[9]],array[offset[10]],array[offset[11]], array[offset[12]], array[offset[13]], array[offset[14]],

AVX/SSE round floats down and return vector of ints?

拟墨画扇 提交于 2021-02-07 08:20:53
问题 Is there a way using AVX/SSE to take a vector of floats, round-down and produce a vector of ints? All the floor intrinsic methods seem to produce a final vector of floating point, which is odd because rounding produces an integer! 回答1: SSE has conversion from FP to integer with your choice of truncation (towards zero) or the current rounding mode (normally the IEEE default mode, nearest with tiebreaks rounding to even. Like nearbyint() , unlike round() where the tiebreak is away-from-0. If

AVX/SSE round floats down and return vector of ints?

▼魔方 西西 提交于 2021-02-07 08:17:27
问题 Is there a way using AVX/SSE to take a vector of floats, round-down and produce a vector of ints? All the floor intrinsic methods seem to produce a final vector of floating point, which is odd because rounding produces an integer! 回答1: SSE has conversion from FP to integer with your choice of truncation (towards zero) or the current rounding mode (normally the IEEE default mode, nearest with tiebreaks rounding to even. Like nearbyint() , unlike round() where the tiebreak is away-from-0. If

How to make premultiplied alpha function faster using SIMD instructions?

↘锁芯ラ 提交于 2021-02-07 06:38:12
问题 I'm looking for some SSE/AVX advice to optimize a routine that premultiplies RGB channel with its alpha channel: RGB * alpha / 255 (+ we keep the original alpha channel). for (int i = 0, max = width * height * 4; i < max; i+=4) { data[i] = static_cast<uint16_t>(data[i] * data[i+3]) / 255; data[i+1] = static_cast<uint16_t>(data[i+1] * data[i+3]) / 255; data[i+2] = static_cast<uint16_t>(data[i+2] * data[i+3]) / 255; } You will find below my current implementation but I think it could be much

Disabling AVX2 in CPU for testing purposes

*爱你&永不变心* 提交于 2021-02-07 05:40:42
问题 I've got an application that requires AVX2 to work correctly. A check was implemented to check during application start if CPU has AVX2 instruction. I would like to check if it works correctly, but i only have CPU that has AVX2. Is there a way to temporarly turn it off for testing purposes? Or to somehow emulate other CPU? 回答1: Yes, use an "emulation" (or dynamic recompilation) layer like Intel's Software Development Emulator (SDE), or maybe QEMU. SDE is closed-source freeware, and very handy

AVX2 integer multiply of signed 8-bit elements, producing signed 16-bit results?

隐身守侯 提交于 2021-02-07 03:44:08
问题 I have two __m256i vectors, filled with 32 8-bit integers. Something like this: __int8 *a0 = new __int8[32] {2}; __int8 *a1 = new __int8[32] {3}; __m256i v0 = _mm256_loadu_si256((__m256i*)a0); __m256i v1 = _mm256_loadu_si256((__m256i*)a1); How can i multiply these vectors, using something like _mm256_mul_epi8(v0, v1) (which does not exist) or any another way? I want 2 vectors of results, because the output element width is twice the input element width. Or something that works similarly to

AVX2 integer multiply of signed 8-bit elements, producing signed 16-bit results?

空扰寡人 提交于 2021-02-07 03:43:17
问题 I have two __m256i vectors, filled with 32 8-bit integers. Something like this: __int8 *a0 = new __int8[32] {2}; __int8 *a1 = new __int8[32] {3}; __m256i v0 = _mm256_loadu_si256((__m256i*)a0); __m256i v1 = _mm256_loadu_si256((__m256i*)a1); How can i multiply these vectors, using something like _mm256_mul_epi8(v0, v1) (which does not exist) or any another way? I want 2 vectors of results, because the output element width is twice the input element width. Or something that works similarly to

I've some problems understanding how AVX shuffle intrinsics are working for 8 bits

倾然丶 夕夏残阳落幕 提交于 2021-02-05 11:51:07
问题 I'm trying to pack 16 bits data to 8 bits by using _mm256_shuffle_epi8 but the result i have is not what i'm expecting. auto srcData = _mm256_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32); __m256i vperm = _mm256_setr_epi8( 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); auto result = _mm256_shuffle_epi8(srcData, vperm); I'm expecting

I've some problems understanding how AVX shuffle intrinsics are working for 8 bits

烈酒焚心 提交于 2021-02-05 11:48:05
问题 I'm trying to pack 16 bits data to 8 bits by using _mm256_shuffle_epi8 but the result i have is not what i'm expecting. auto srcData = _mm256_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32); __m256i vperm = _mm256_setr_epi8( 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); auto result = _mm256_shuffle_epi8(srcData, vperm); I'm expecting