MSVC2105 update2 compiles the C code:
abs_residual_partition_sums[partition] =
(FLAC__uint32)_mm_cvtsi128_si32(mm_sum);
into this:
movq QWORD PTR [rsi], xmm2
while it should be:
movd eax, xmm2
mov QWORD PTR [rsi], rax
With this patch, MSVC emits:
movq QWORD PTR [rsi], xmm2
mov DWORD PTR [rsi+4], r9d
so the price of this workaround is 1 extra write instruction per
partition.
Patch-from: lvqcl <lvqcl.mail@gmail.com>
In the precompute_partition_info_sums_ function, instead of selecting
64-bit accumulator when the signal bps is larger than 16, revert to the
original approach based on partition size, but make room for few extra
bits to not overflow with unusual signals where the average residual
magnitude may be larger than bps.
It slightly improves the performance with standard encoding levels and
16-bit files as the 17-bit side channel can still be processed with the
32-bit accumulator and correctly selects the 64-bit accumulator with
very large 16-bit partitions.
This is related to commits 6f7ec60c and 187e596e.
Signed-off-by: Erik de Castro Lopo <erikd@mega-nerd.com>
Most non-static functions have FLAC__ prefix, but they were missing
from the precompute_partition_info_sums_* functions.
Patch-from: lvqcl <lvqcl.mail@gmail.com>
* Splits lpc_x86intrin.c to lpc_intrin_sse.c and lpc_intrin_sse2.c
* Add FLAC__lpc_compute_residual_from_qlp_coefficients_intrin_sse2()
function to lpc_intrin_sse2.c
* Add lpc_intrin_sse41.c with two ..._wide_intrin_sse41() functions
(useful for 24-bit en-/decoding)
* Add precompute_partition_info_sums_intrin_sse2() / ...ssse3() and
disables precompute_partition_info_sums_32bit_asm_ia32_().
SSE2 version uses 4 SSE2 instructions instead of 1 SSSE3 instruction
PABSD so it is slightly slower.
Patch-from: lvqcl <lvqcl.mail@gmail.com>